|
--- |
|
language: |
|
- zh |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- PEG |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- mteb |
|
model-index: |
|
- name: PEG |
|
results: |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv1-reranking |
|
name: MTEB CMedQAv1 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 84.09137463267582 |
|
- type: mrr |
|
value: 86.6288888888889 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv2-reranking |
|
name: MTEB CMedQAv2 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 86.55765031914974 |
|
- type: mrr |
|
value: 89.4325396825397 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/CmedqaRetrieval |
|
name: MTEB CmedqaRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 26.101000000000003 |
|
- type: map_at_10 |
|
value: 38.239000000000004 |
|
- type: map_at_100 |
|
value: 40.083 |
|
- type: map_at_1000 |
|
value: 40.205 |
|
- type: map_at_3 |
|
value: 34.386 |
|
- type: map_at_5 |
|
value: 36.425999999999995 |
|
- type: mrr_at_1 |
|
value: 39.434999999999995 |
|
- type: mrr_at_10 |
|
value: 46.967999999999996 |
|
- type: mrr_at_100 |
|
value: 47.946 |
|
- type: mrr_at_1000 |
|
value: 47.997 |
|
- type: mrr_at_3 |
|
value: 44.803 |
|
- type: mrr_at_5 |
|
value: 45.911 |
|
- type: ndcg_at_1 |
|
value: 39.434999999999995 |
|
- type: ndcg_at_10 |
|
value: 44.416 |
|
- type: ndcg_at_100 |
|
value: 51.773 |
|
- type: ndcg_at_1000 |
|
value: 53.888000000000005 |
|
- type: ndcg_at_3 |
|
value: 39.816 |
|
- type: ndcg_at_5 |
|
value: 41.467999999999996 |
|
- type: precision_at_1 |
|
value: 39.434999999999995 |
|
- type: precision_at_10 |
|
value: 9.786999999999999 |
|
- type: precision_at_100 |
|
value: 1.5810000000000002 |
|
- type: precision_at_1000 |
|
value: 0.184 |
|
- type: precision_at_3 |
|
value: 22.414 |
|
- type: precision_at_5 |
|
value: 15.943999999999999 |
|
- type: recall_at_1 |
|
value: 26.101000000000003 |
|
- type: recall_at_10 |
|
value: 53.82900000000001 |
|
- type: recall_at_100 |
|
value: 84.63199999999999 |
|
- type: recall_at_1000 |
|
value: 98.782 |
|
- type: recall_at_3 |
|
value: 39.585 |
|
- type: recall_at_5 |
|
value: 45.141 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/CovidRetrieval |
|
name: MTEB CovidRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 70.39 |
|
- type: map_at_10 |
|
value: 78.93599999999999 |
|
- type: map_at_100 |
|
value: 79.202 |
|
- type: map_at_1000 |
|
value: 79.205 |
|
- type: map_at_3 |
|
value: 77.538 |
|
- type: map_at_5 |
|
value: 78.312 |
|
- type: mrr_at_1 |
|
value: 70.706 |
|
- type: mrr_at_10 |
|
value: 79.018 |
|
- type: mrr_at_100 |
|
value: 79.28399999999999 |
|
- type: mrr_at_1000 |
|
value: 79.288 |
|
- type: mrr_at_3 |
|
value: 77.713 |
|
- type: mrr_at_5 |
|
value: 78.462 |
|
- type: ndcg_at_1 |
|
value: 70.601 |
|
- type: ndcg_at_10 |
|
value: 82.555 |
|
- type: ndcg_at_100 |
|
value: 83.718 |
|
- type: ndcg_at_1000 |
|
value: 83.855 |
|
- type: ndcg_at_3 |
|
value: 79.779 |
|
- type: ndcg_at_5 |
|
value: 81.149 |
|
- type: precision_at_1 |
|
value: 70.601 |
|
- type: precision_at_10 |
|
value: 9.463000000000001 |
|
- type: precision_at_100 |
|
value: 0.9979999999999999 |
|
- type: precision_at_1000 |
|
value: 0.101 |
|
- type: precision_at_3 |
|
value: 28.871999999999996 |
|
- type: precision_at_5 |
|
value: 18.019 |
|
- type: recall_at_1 |
|
value: 70.39 |
|
- type: recall_at_10 |
|
value: 93.572 |
|
- type: recall_at_100 |
|
value: 98.736 |
|
- type: recall_at_1000 |
|
value: 99.895 |
|
- type: recall_at_3 |
|
value: 86.091 |
|
- type: recall_at_5 |
|
value: 89.384 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/DuRetrieval |
|
name: MTEB DuRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 26.147 |
|
- type: map_at_10 |
|
value: 80.205 |
|
- type: map_at_100 |
|
value: 82.96 |
|
- type: map_at_1000 |
|
value: 82.999 |
|
- type: map_at_3 |
|
value: 55.16799999999999 |
|
- type: map_at_5 |
|
value: 69.798 |
|
- type: mrr_at_1 |
|
value: 89.8 |
|
- type: mrr_at_10 |
|
value: 93.16799999999999 |
|
- type: mrr_at_100 |
|
value: 93.22500000000001 |
|
- type: mrr_at_1000 |
|
value: 93.228 |
|
- type: mrr_at_3 |
|
value: 92.85 |
|
- type: mrr_at_5 |
|
value: 93.067 |
|
- type: ndcg_at_1 |
|
value: 89.8 |
|
- type: ndcg_at_10 |
|
value: 87.668 |
|
- type: ndcg_at_100 |
|
value: 90.16 |
|
- type: ndcg_at_1000 |
|
value: 90.505 |
|
- type: ndcg_at_3 |
|
value: 85.842 |
|
- type: ndcg_at_5 |
|
value: 85.101 |
|
- type: precision_at_1 |
|
value: 89.8 |
|
- type: precision_at_10 |
|
value: 42.225 |
|
- type: precision_at_100 |
|
value: 4.8149999999999995 |
|
- type: precision_at_1000 |
|
value: 0.48900000000000005 |
|
- type: precision_at_3 |
|
value: 76.967 |
|
- type: precision_at_5 |
|
value: 65.32 |
|
- type: recall_at_1 |
|
value: 26.147 |
|
- type: recall_at_10 |
|
value: 89.30399999999999 |
|
- type: recall_at_100 |
|
value: 97.609 |
|
- type: recall_at_1000 |
|
value: 99.409 |
|
- type: recall_at_3 |
|
value: 57.56 |
|
- type: recall_at_5 |
|
value: 74.78200000000001 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/EcomRetrieval |
|
name: MTEB EcomRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 53.300000000000004 |
|
- type: map_at_10 |
|
value: 62.507000000000005 |
|
- type: map_at_100 |
|
value: 63.068000000000005 |
|
- type: map_at_1000 |
|
value: 63.08200000000001 |
|
- type: map_at_3 |
|
value: 60.050000000000004 |
|
- type: map_at_5 |
|
value: 61.41 |
|
- type: mrr_at_1 |
|
value: 53.300000000000004 |
|
- type: mrr_at_10 |
|
value: 62.507000000000005 |
|
- type: mrr_at_100 |
|
value: 63.068000000000005 |
|
- type: mrr_at_1000 |
|
value: 63.08200000000001 |
|
- type: mrr_at_3 |
|
value: 60.050000000000004 |
|
- type: mrr_at_5 |
|
value: 61.41 |
|
- type: ndcg_at_1 |
|
value: 53.300000000000004 |
|
- type: ndcg_at_10 |
|
value: 67.31700000000001 |
|
- type: ndcg_at_100 |
|
value: 69.862 |
|
- type: ndcg_at_1000 |
|
value: 70.231 |
|
- type: ndcg_at_3 |
|
value: 62.222 |
|
- type: ndcg_at_5 |
|
value: 64.66300000000001 |
|
- type: precision_at_1 |
|
value: 53.300000000000004 |
|
- type: precision_at_10 |
|
value: 8.260000000000002 |
|
- type: precision_at_100 |
|
value: 0.941 |
|
- type: precision_at_1000 |
|
value: 0.097 |
|
- type: precision_at_3 |
|
value: 22.833000000000002 |
|
- type: precision_at_5 |
|
value: 14.879999999999999 |
|
- type: recall_at_1 |
|
value: 53.300000000000004 |
|
- type: recall_at_10 |
|
value: 82.6 |
|
- type: recall_at_100 |
|
value: 94.1 |
|
- type: recall_at_1000 |
|
value: 97.0 |
|
- type: recall_at_3 |
|
value: 68.5 |
|
- type: recall_at_5 |
|
value: 74.4 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/MMarcoRetrieval |
|
name: MTEB MMarcoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 70.68799999999999 |
|
- type: map_at_10 |
|
value: 79.28399999999999 |
|
- type: map_at_100 |
|
value: 79.537 |
|
- type: map_at_1000 |
|
value: 79.545 |
|
- type: map_at_3 |
|
value: 77.643 |
|
- type: map_at_5 |
|
value: 78.694 |
|
- type: mrr_at_1 |
|
value: 73.05199999999999 |
|
- type: mrr_at_10 |
|
value: 79.794 |
|
- type: mrr_at_100 |
|
value: 80.024 |
|
- type: mrr_at_1000 |
|
value: 80.03099999999999 |
|
- type: mrr_at_3 |
|
value: 78.441 |
|
- type: mrr_at_5 |
|
value: 79.29 |
|
- type: ndcg_at_1 |
|
value: 73.05199999999999 |
|
- type: ndcg_at_10 |
|
value: 82.627 |
|
- type: ndcg_at_100 |
|
value: 83.737 |
|
- type: ndcg_at_1000 |
|
value: 83.946 |
|
- type: ndcg_at_3 |
|
value: 79.585 |
|
- type: ndcg_at_5 |
|
value: 81.306 |
|
- type: precision_at_1 |
|
value: 73.05199999999999 |
|
- type: precision_at_10 |
|
value: 9.835 |
|
- type: precision_at_100 |
|
value: 1.038 |
|
- type: precision_at_1000 |
|
value: 0.106 |
|
- type: precision_at_3 |
|
value: 29.756 |
|
- type: precision_at_5 |
|
value: 18.788 |
|
- type: recall_at_1 |
|
value: 70.68799999999999 |
|
- type: recall_at_10 |
|
value: 92.38300000000001 |
|
- type: recall_at_100 |
|
value: 97.347 |
|
- type: recall_at_1000 |
|
value: 98.992 |
|
- type: recall_at_3 |
|
value: 84.37 |
|
- type: recall_at_5 |
|
value: 88.434 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/MedicalRetrieval |
|
name: MTEB MedicalRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 53.1 |
|
- type: map_at_10 |
|
value: 58.36599999999999 |
|
- type: map_at_100 |
|
value: 58.939 |
|
- type: map_at_1000 |
|
value: 58.99100000000001 |
|
- type: map_at_3 |
|
value: 57.15 |
|
- type: map_at_5 |
|
value: 57.794999999999995 |
|
- type: mrr_at_1 |
|
value: 53.2 |
|
- type: mrr_at_10 |
|
value: 58.416000000000004 |
|
- type: mrr_at_100 |
|
value: 58.989999999999995 |
|
- type: mrr_at_1000 |
|
value: 59.041 |
|
- type: mrr_at_3 |
|
value: 57.199999999999996 |
|
- type: mrr_at_5 |
|
value: 57.845 |
|
- type: ndcg_at_1 |
|
value: 53.1 |
|
- type: ndcg_at_10 |
|
value: 60.989000000000004 |
|
- type: ndcg_at_100 |
|
value: 63.967 |
|
- type: ndcg_at_1000 |
|
value: 65.436 |
|
- type: ndcg_at_3 |
|
value: 58.425000000000004 |
|
- type: ndcg_at_5 |
|
value: 59.583 |
|
- type: precision_at_1 |
|
value: 53.1 |
|
- type: precision_at_10 |
|
value: 6.93 |
|
- type: precision_at_100 |
|
value: 0.8370000000000001 |
|
- type: precision_at_1000 |
|
value: 0.096 |
|
- type: precision_at_3 |
|
value: 20.7 |
|
- type: precision_at_5 |
|
value: 12.98 |
|
- type: recall_at_1 |
|
value: 53.1 |
|
- type: recall_at_10 |
|
value: 69.3 |
|
- type: recall_at_100 |
|
value: 83.7 |
|
- type: recall_at_1000 |
|
value: 95.5 |
|
- type: recall_at_3 |
|
value: 62.1 |
|
- type: recall_at_5 |
|
value: 64.9 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/Mmarco-reranking |
|
name: MTEB MMarcoReranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 33.548800108363665 |
|
- type: mrr |
|
value: 32.529761904761905 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/T2Reranking |
|
name: MTEB T2Reranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 69.43381583724414 |
|
- type: mrr |
|
value: 80.47879657392181 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/T2Retrieval |
|
name: MTEB T2Retrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 28.116000000000003 |
|
- type: map_at_10 |
|
value: 80.026 |
|
- type: map_at_100 |
|
value: 83.541 |
|
- type: map_at_1000 |
|
value: 83.592 |
|
- type: map_at_3 |
|
value: 56.092 |
|
- type: map_at_5 |
|
value: 69.114 |
|
- type: mrr_at_1 |
|
value: 91.557 |
|
- type: mrr_at_10 |
|
value: 93.73700000000001 |
|
- type: mrr_at_100 |
|
value: 93.808 |
|
- type: mrr_at_1000 |
|
value: 93.811 |
|
- type: mrr_at_3 |
|
value: 93.384 |
|
- type: mrr_at_5 |
|
value: 93.614 |
|
- type: ndcg_at_1 |
|
value: 91.553 |
|
- type: ndcg_at_10 |
|
value: 87.003 |
|
- type: ndcg_at_100 |
|
value: 90.128 |
|
- type: ndcg_at_1000 |
|
value: 90.615 |
|
- type: ndcg_at_3 |
|
value: 88.205 |
|
- type: ndcg_at_5 |
|
value: 86.978 |
|
- type: precision_at_1 |
|
value: 91.553 |
|
- type: precision_at_10 |
|
value: 43.25 |
|
- type: precision_at_100 |
|
value: 5.067 |
|
- type: precision_at_1000 |
|
value: 0.518 |
|
- type: precision_at_3 |
|
value: 77.25 |
|
- type: precision_at_5 |
|
value: 64.902 |
|
- type: recall_at_1 |
|
value: 28.116000000000003 |
|
- type: recall_at_10 |
|
value: 85.994 |
|
- type: recall_at_100 |
|
value: 96.345 |
|
- type: recall_at_1000 |
|
value: 98.867 |
|
- type: recall_at_3 |
|
value: 57.67099999999999 |
|
- type: recall_at_5 |
|
value: 72.26 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C_MTEB/VideoRetrieval |
|
name: MTEB VideoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 64.9 |
|
- type: map_at_10 |
|
value: 73.763 |
|
- type: map_at_100 |
|
value: 74.116 |
|
- type: map_at_1000 |
|
value: 74.12100000000001 |
|
- type: map_at_3 |
|
value: 72.15 |
|
- type: map_at_5 |
|
value: 73.25 |
|
- type: mrr_at_1 |
|
value: 64.9 |
|
- type: mrr_at_10 |
|
value: 73.763 |
|
- type: mrr_at_100 |
|
value: 74.116 |
|
- type: mrr_at_1000 |
|
value: 74.12100000000001 |
|
- type: mrr_at_3 |
|
value: 72.15 |
|
- type: mrr_at_5 |
|
value: 73.25 |
|
- type: ndcg_at_1 |
|
value: 64.9 |
|
- type: ndcg_at_10 |
|
value: 77.639 |
|
- type: ndcg_at_100 |
|
value: 79.396 |
|
- type: ndcg_at_1000 |
|
value: 79.554 |
|
- type: ndcg_at_3 |
|
value: 74.406 |
|
- type: ndcg_at_5 |
|
value: 76.385 |
|
- type: precision_at_1 |
|
value: 64.9 |
|
- type: precision_at_10 |
|
value: 8.959999999999999 |
|
- type: precision_at_100 |
|
value: 0.979 |
|
- type: precision_at_1000 |
|
value: 0.099 |
|
- type: precision_at_3 |
|
value: 26.967000000000002 |
|
- type: precision_at_5 |
|
value: 17.14 |
|
- type: recall_at_1 |
|
value: 64.9 |
|
- type: recall_at_10 |
|
value: 89.60000000000001 |
|
- type: recall_at_100 |
|
value: 97.89999999999999 |
|
- type: recall_at_1000 |
|
value: 99.2 |
|
- type: recall_at_3 |
|
value: 80.9 |
|
- type: recall_at_5 |
|
value: 85.7 |
|
--- |
|
license: apache-2.0 |
|
library_name: transformers |
|
--- |
|
|
|
<h1 align="center">PEG: Towards Robust Text Retrieval with Progressive Learning</h1> |
|
|
|
## Model Details |
|
We propose the PEG model (a Progressively Learned Textual Embedding), which progressively adjusts the weights of samples contributing to the loss within an extremely large batch, based on the difficulty levels of negative samples. |
|
we have amassed an extensive collection of over 110 million data, spanning a wide range of fields such as general knowledge, finance, tourism, medicine, and more. |
|
|
|
Our technical report is available at [Paper](https://arxiv.org/pdf/2311.11691.pdf) |
|
|
|
## Usage (HuggingFace Transformers) |
|
|
|
Install transformers: |
|
``` |
|
pip install transformers |
|
``` |
|
|
|
Then load model and predict: |
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
|
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('TownsWu/PEG') |
|
model = AutoModel.from_pretrained('TownsWu/PEG') |
|
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'] |
|
# Tokenize sentences |
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
last_hidden_state = model(**inputs, return_dict=True).last_hidden_state |
|
embeddings = last_hidden_state[:, 0] |
|
print("embeddings:") |
|
print(embeddings) |
|
``` |
|
|
|
## Contact |
|
If you have any question or suggestion related to this project, feel free to open an issue or pull request. |
|
You also can email Tong Wu(townswu@tencent.com). |
|
|
|
|
|
## Citation |
|
|
|
If you find our work helpful for your research, please consider citing the following BibTeX entry: |
|
|
|
``` |
|
|
|
@article{wu2023towards, |
|
title={Towards Robust Text Retrieval with Progressive Learning}, |
|
author={Wu, Tong and Qin, Yulei and Zhang, Enwei and Xu, Zihan and Gao, Yuting and Li, Ke and Sun, Xing}, |
|
journal={arXiv preprint arXiv:2311.11691}, |
|
year={2023} |
|
} |
|
|
|
``` |