|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- semantic-search |
|
- chinese |
|
- mteb |
|
model-index: |
|
- name: sbert-chinese-general-v1 |
|
results: |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/AFQMC |
|
name: MTEB AFQMC |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 22.293919432958074 |
|
- type: cos_sim_spearman |
|
value: 22.56718923553609 |
|
- type: euclidean_pearson |
|
value: 22.525656322797026 |
|
- type: euclidean_spearman |
|
value: 22.56718923553609 |
|
- type: manhattan_pearson |
|
value: 22.501773028824065 |
|
- type: manhattan_spearman |
|
value: 22.536992587828397 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/ATEC |
|
name: MTEB ATEC |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 30.33575274463879 |
|
- type: cos_sim_spearman |
|
value: 30.298708742167772 |
|
- type: euclidean_pearson |
|
value: 32.33094743729218 |
|
- type: euclidean_spearman |
|
value: 30.298710993858734 |
|
- type: manhattan_pearson |
|
value: 32.31155376195945 |
|
- type: manhattan_spearman |
|
value: 30.267669681690744 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: mteb/amazon_reviews_multi |
|
name: MTEB AmazonReviewsClassification (zh) |
|
config: zh |
|
split: test |
|
revision: 1399c76144fd37290681b995c656ef9b2e06e26d |
|
metrics: |
|
- type: accuracy |
|
value: 37.507999999999996 |
|
- type: f1 |
|
value: 36.436808400753286 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/BQ |
|
name: MTEB BQ |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 41.493256724214255 |
|
- type: cos_sim_spearman |
|
value: 40.98395961967895 |
|
- type: euclidean_pearson |
|
value: 41.12345737966565 |
|
- type: euclidean_spearman |
|
value: 40.983959619555996 |
|
- type: manhattan_pearson |
|
value: 41.02584539471014 |
|
- type: manhattan_spearman |
|
value: 40.87549513383032 |
|
- task: |
|
type: BitextMining |
|
dataset: |
|
type: mteb/bucc-bitext-mining |
|
name: MTEB BUCC (zh-en) |
|
config: zh-en |
|
split: test |
|
revision: d51519689f32196a32af33b075a01d0e7c51e252 |
|
metrics: |
|
- type: accuracy |
|
value: 9.794628751974724 |
|
- type: f1 |
|
value: 9.350535369492716 |
|
- type: precision |
|
value: 9.179392662804986 |
|
- type: recall |
|
value: 9.794628751974724 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/CLSClusteringP2P |
|
name: MTEB CLSClusteringP2P |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 34.984726547788284 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/CLSClusteringS2S |
|
name: MTEB CLSClusteringS2S |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 27.81945732281589 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv1-reranking |
|
name: MTEB CMedQAv1 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 53.06586280826805 |
|
- type: mrr |
|
value: 59.58781746031746 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv2-reranking |
|
name: MTEB CMedQAv2 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 52.83635946154306 |
|
- type: mrr |
|
value: 59.315079365079356 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CmedqaRetrieval |
|
name: MTEB CmedqaRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 5.721 |
|
- type: map_at_10 |
|
value: 8.645 |
|
- type: map_at_100 |
|
value: 9.434 |
|
- type: map_at_1000 |
|
value: 9.586 |
|
- type: map_at_3 |
|
value: 7.413 |
|
- type: map_at_5 |
|
value: 8.05 |
|
- type: mrr_at_1 |
|
value: 9.626999999999999 |
|
- type: mrr_at_10 |
|
value: 13.094 |
|
- type: mrr_at_100 |
|
value: 13.854 |
|
- type: mrr_at_1000 |
|
value: 13.958 |
|
- type: mrr_at_3 |
|
value: 11.724 |
|
- type: mrr_at_5 |
|
value: 12.409 |
|
- type: ndcg_at_1 |
|
value: 9.626999999999999 |
|
- type: ndcg_at_10 |
|
value: 11.35 |
|
- type: ndcg_at_100 |
|
value: 15.593000000000002 |
|
- type: ndcg_at_1000 |
|
value: 19.619 |
|
- type: ndcg_at_3 |
|
value: 9.317 |
|
- type: ndcg_at_5 |
|
value: 10.049 |
|
- type: precision_at_1 |
|
value: 9.626999999999999 |
|
- type: precision_at_10 |
|
value: 2.796 |
|
- type: precision_at_100 |
|
value: 0.629 |
|
- type: precision_at_1000 |
|
value: 0.11800000000000001 |
|
- type: precision_at_3 |
|
value: 5.476 |
|
- type: precision_at_5 |
|
value: 4.1209999999999996 |
|
- type: recall_at_1 |
|
value: 5.721 |
|
- type: recall_at_10 |
|
value: 15.190000000000001 |
|
- type: recall_at_100 |
|
value: 33.633 |
|
- type: recall_at_1000 |
|
value: 62.019999999999996 |
|
- type: recall_at_3 |
|
value: 9.099 |
|
- type: recall_at_5 |
|
value: 11.423 |
|
- task: |
|
type: PairClassification |
|
dataset: |
|
type: C-MTEB/CMNLI |
|
name: MTEB Cmnli |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: cos_sim_accuracy |
|
value: 77.36620565243535 |
|
- type: cos_sim_ap |
|
value: 85.92291866877001 |
|
- type: cos_sim_f1 |
|
value: 78.19390231037029 |
|
- type: cos_sim_precision |
|
value: 71.24183006535948 |
|
- type: cos_sim_recall |
|
value: 86.64952069207388 |
|
- type: dot_accuracy |
|
value: 77.36620565243535 |
|
- type: dot_ap |
|
value: 85.94113738490068 |
|
- type: dot_f1 |
|
value: 78.19390231037029 |
|
- type: dot_precision |
|
value: 71.24183006535948 |
|
- type: dot_recall |
|
value: 86.64952069207388 |
|
- type: euclidean_accuracy |
|
value: 77.36620565243535 |
|
- type: euclidean_ap |
|
value: 85.92291893444687 |
|
- type: euclidean_f1 |
|
value: 78.19390231037029 |
|
- type: euclidean_precision |
|
value: 71.24183006535948 |
|
- type: euclidean_recall |
|
value: 86.64952069207388 |
|
- type: manhattan_accuracy |
|
value: 77.29404690318701 |
|
- type: manhattan_ap |
|
value: 85.88284362100919 |
|
- type: manhattan_f1 |
|
value: 78.17836812144213 |
|
- type: manhattan_precision |
|
value: 71.18448838548666 |
|
- type: manhattan_recall |
|
value: 86.69628244096329 |
|
- type: max_accuracy |
|
value: 77.36620565243535 |
|
- type: max_ap |
|
value: 85.94113738490068 |
|
- type: max_f1 |
|
value: 78.19390231037029 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CovidRetrieval |
|
name: MTEB CovidRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 26.976 |
|
- type: map_at_10 |
|
value: 35.18 |
|
- type: map_at_100 |
|
value: 35.921 |
|
- type: map_at_1000 |
|
value: 35.998999999999995 |
|
- type: map_at_3 |
|
value: 32.763 |
|
- type: map_at_5 |
|
value: 34.165 |
|
- type: mrr_at_1 |
|
value: 26.976 |
|
- type: mrr_at_10 |
|
value: 35.234 |
|
- type: mrr_at_100 |
|
value: 35.939 |
|
- type: mrr_at_1000 |
|
value: 36.016 |
|
- type: mrr_at_3 |
|
value: 32.771 |
|
- type: mrr_at_5 |
|
value: 34.172999999999995 |
|
- type: ndcg_at_1 |
|
value: 26.976 |
|
- type: ndcg_at_10 |
|
value: 39.635 |
|
- type: ndcg_at_100 |
|
value: 43.54 |
|
- type: ndcg_at_1000 |
|
value: 45.723 |
|
- type: ndcg_at_3 |
|
value: 34.652 |
|
- type: ndcg_at_5 |
|
value: 37.186 |
|
- type: precision_at_1 |
|
value: 26.976 |
|
- type: precision_at_10 |
|
value: 5.406 |
|
- type: precision_at_100 |
|
value: 0.736 |
|
- type: precision_at_1000 |
|
value: 0.091 |
|
- type: precision_at_3 |
|
value: 13.418 |
|
- type: precision_at_5 |
|
value: 9.293999999999999 |
|
- type: recall_at_1 |
|
value: 26.976 |
|
- type: recall_at_10 |
|
value: 53.766999999999996 |
|
- type: recall_at_100 |
|
value: 72.761 |
|
- type: recall_at_1000 |
|
value: 90.148 |
|
- type: recall_at_3 |
|
value: 40.095 |
|
- type: recall_at_5 |
|
value: 46.233000000000004 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/DuRetrieval |
|
name: MTEB DuRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 11.285 |
|
- type: map_at_10 |
|
value: 30.259000000000004 |
|
- type: map_at_100 |
|
value: 33.772000000000006 |
|
- type: map_at_1000 |
|
value: 34.037 |
|
- type: map_at_3 |
|
value: 21.038999999999998 |
|
- type: map_at_5 |
|
value: 25.939 |
|
- type: mrr_at_1 |
|
value: 45.1 |
|
- type: mrr_at_10 |
|
value: 55.803999999999995 |
|
- type: mrr_at_100 |
|
value: 56.301 |
|
- type: mrr_at_1000 |
|
value: 56.330999999999996 |
|
- type: mrr_at_3 |
|
value: 53.333 |
|
- type: mrr_at_5 |
|
value: 54.798 |
|
- type: ndcg_at_1 |
|
value: 45.1 |
|
- type: ndcg_at_10 |
|
value: 41.156 |
|
- type: ndcg_at_100 |
|
value: 49.518 |
|
- type: ndcg_at_1000 |
|
value: 52.947 |
|
- type: ndcg_at_3 |
|
value: 39.708 |
|
- type: ndcg_at_5 |
|
value: 38.704 |
|
- type: precision_at_1 |
|
value: 45.1 |
|
- type: precision_at_10 |
|
value: 20.75 |
|
- type: precision_at_100 |
|
value: 3.424 |
|
- type: precision_at_1000 |
|
value: 0.42700000000000005 |
|
- type: precision_at_3 |
|
value: 35.632999999999996 |
|
- type: precision_at_5 |
|
value: 30.080000000000002 |
|
- type: recall_at_1 |
|
value: 11.285 |
|
- type: recall_at_10 |
|
value: 43.242000000000004 |
|
- type: recall_at_100 |
|
value: 68.604 |
|
- type: recall_at_1000 |
|
value: 85.904 |
|
- type: recall_at_3 |
|
value: 24.404 |
|
- type: recall_at_5 |
|
value: 32.757 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/EcomRetrieval |
|
name: MTEB EcomRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 21 |
|
- type: map_at_10 |
|
value: 28.364 |
|
- type: map_at_100 |
|
value: 29.199 |
|
- type: map_at_1000 |
|
value: 29.265 |
|
- type: map_at_3 |
|
value: 25.717000000000002 |
|
- type: map_at_5 |
|
value: 27.311999999999998 |
|
- type: mrr_at_1 |
|
value: 21 |
|
- type: mrr_at_10 |
|
value: 28.364 |
|
- type: mrr_at_100 |
|
value: 29.199 |
|
- type: mrr_at_1000 |
|
value: 29.265 |
|
- type: mrr_at_3 |
|
value: 25.717000000000002 |
|
- type: mrr_at_5 |
|
value: 27.311999999999998 |
|
- type: ndcg_at_1 |
|
value: 21 |
|
- type: ndcg_at_10 |
|
value: 32.708 |
|
- type: ndcg_at_100 |
|
value: 37.184 |
|
- type: ndcg_at_1000 |
|
value: 39.273 |
|
- type: ndcg_at_3 |
|
value: 27.372000000000003 |
|
- type: ndcg_at_5 |
|
value: 30.23 |
|
- type: precision_at_1 |
|
value: 21 |
|
- type: precision_at_10 |
|
value: 4.66 |
|
- type: precision_at_100 |
|
value: 0.685 |
|
- type: precision_at_1000 |
|
value: 0.086 |
|
- type: precision_at_3 |
|
value: 10.732999999999999 |
|
- type: precision_at_5 |
|
value: 7.82 |
|
- type: recall_at_1 |
|
value: 21 |
|
- type: recall_at_10 |
|
value: 46.6 |
|
- type: recall_at_100 |
|
value: 68.5 |
|
- type: recall_at_1000 |
|
value: 85.6 |
|
- type: recall_at_3 |
|
value: 32.2 |
|
- type: recall_at_5 |
|
value: 39.1 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/IFlyTek-classification |
|
name: MTEB IFlyTek |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 44.878799538283964 |
|
- type: f1 |
|
value: 33.84678310261366 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/JDReview-classification |
|
name: MTEB JDReview |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 82.1951219512195 |
|
- type: ap |
|
value: 46.78292030042397 |
|
- type: f1 |
|
value: 76.20482468514128 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/LCQMC |
|
name: MTEB LCQMC |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 62.84331627244547 |
|
- type: cos_sim_spearman |
|
value: 68.39990265073726 |
|
- type: euclidean_pearson |
|
value: 66.87431827169324 |
|
- type: euclidean_spearman |
|
value: 68.39990264979167 |
|
- type: manhattan_pearson |
|
value: 66.89702078900328 |
|
- type: manhattan_spearman |
|
value: 68.42107302159141 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/Mmarco-reranking |
|
name: MTEB MMarcoReranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 9.28600891904827 |
|
- type: mrr |
|
value: 8.057936507936509 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MMarcoRetrieval |
|
name: MTEB MMarcoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 22.820999999999998 |
|
- type: map_at_10 |
|
value: 30.44 |
|
- type: map_at_100 |
|
value: 31.35 |
|
- type: map_at_1000 |
|
value: 31.419000000000004 |
|
- type: map_at_3 |
|
value: 28.134999999999998 |
|
- type: map_at_5 |
|
value: 29.482000000000003 |
|
- type: mrr_at_1 |
|
value: 23.782 |
|
- type: mrr_at_10 |
|
value: 31.141999999999996 |
|
- type: mrr_at_100 |
|
value: 32.004 |
|
- type: mrr_at_1000 |
|
value: 32.068000000000005 |
|
- type: mrr_at_3 |
|
value: 28.904000000000003 |
|
- type: mrr_at_5 |
|
value: 30.214999999999996 |
|
- type: ndcg_at_1 |
|
value: 23.782 |
|
- type: ndcg_at_10 |
|
value: 34.625 |
|
- type: ndcg_at_100 |
|
value: 39.226 |
|
- type: ndcg_at_1000 |
|
value: 41.128 |
|
- type: ndcg_at_3 |
|
value: 29.968 |
|
- type: ndcg_at_5 |
|
value: 32.35 |
|
- type: precision_at_1 |
|
value: 23.782 |
|
- type: precision_at_10 |
|
value: 4.994 |
|
- type: precision_at_100 |
|
value: 0.736 |
|
- type: precision_at_1000 |
|
value: 0.09 |
|
- type: precision_at_3 |
|
value: 12.13 |
|
- type: precision_at_5 |
|
value: 8.495999999999999 |
|
- type: recall_at_1 |
|
value: 22.820999999999998 |
|
- type: recall_at_10 |
|
value: 47.141 |
|
- type: recall_at_100 |
|
value: 68.952 |
|
- type: recall_at_1000 |
|
value: 83.985 |
|
- type: recall_at_3 |
|
value: 34.508 |
|
- type: recall_at_5 |
|
value: 40.232 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: mteb/amazon_massive_intent |
|
name: MTEB MassiveIntentClassification (zh-CN) |
|
config: zh-CN |
|
split: test |
|
revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7 |
|
metrics: |
|
- type: accuracy |
|
value: 57.343644922663074 |
|
- type: f1 |
|
value: 56.744802953803486 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: mteb/amazon_massive_scenario |
|
name: MTEB MassiveScenarioClassification (zh-CN) |
|
config: zh-CN |
|
split: test |
|
revision: 7d571f92784cd94a019292a1f45445077d0ef634 |
|
metrics: |
|
- type: accuracy |
|
value: 62.363819771351714 |
|
- type: f1 |
|
value: 62.15920863434656 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MedicalRetrieval |
|
name: MTEB MedicalRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 14.6 |
|
- type: map_at_10 |
|
value: 18.231 |
|
- type: map_at_100 |
|
value: 18.744 |
|
- type: map_at_1000 |
|
value: 18.811 |
|
- type: map_at_3 |
|
value: 17.133000000000003 |
|
- type: map_at_5 |
|
value: 17.663 |
|
- type: mrr_at_1 |
|
value: 14.6 |
|
- type: mrr_at_10 |
|
value: 18.231 |
|
- type: mrr_at_100 |
|
value: 18.744 |
|
- type: mrr_at_1000 |
|
value: 18.811 |
|
- type: mrr_at_3 |
|
value: 17.133000000000003 |
|
- type: mrr_at_5 |
|
value: 17.663 |
|
- type: ndcg_at_1 |
|
value: 14.6 |
|
- type: ndcg_at_10 |
|
value: 20.349 |
|
- type: ndcg_at_100 |
|
value: 23.204 |
|
- type: ndcg_at_1000 |
|
value: 25.44 |
|
- type: ndcg_at_3 |
|
value: 17.995 |
|
- type: ndcg_at_5 |
|
value: 18.945999999999998 |
|
- type: precision_at_1 |
|
value: 14.6 |
|
- type: precision_at_10 |
|
value: 2.7199999999999998 |
|
- type: precision_at_100 |
|
value: 0.414 |
|
- type: precision_at_1000 |
|
value: 0.06 |
|
- type: precision_at_3 |
|
value: 6.833 |
|
- type: precision_at_5 |
|
value: 4.5600000000000005 |
|
- type: recall_at_1 |
|
value: 14.6 |
|
- type: recall_at_10 |
|
value: 27.200000000000003 |
|
- type: recall_at_100 |
|
value: 41.4 |
|
- type: recall_at_1000 |
|
value: 60 |
|
- type: recall_at_3 |
|
value: 20.5 |
|
- type: recall_at_5 |
|
value: 22.8 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/MultilingualSentiment-classification |
|
name: MTEB MultilingualSentiment |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 66.58333333333333 |
|
- type: f1 |
|
value: 66.26700927460007 |
|
- task: |
|
type: PairClassification |
|
dataset: |
|
type: C-MTEB/OCNLI |
|
name: MTEB Ocnli |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: cos_sim_accuracy |
|
value: 72.00866269626421 |
|
- type: cos_sim_ap |
|
value: 77.00520104243304 |
|
- type: cos_sim_f1 |
|
value: 74.39303710490151 |
|
- type: cos_sim_precision |
|
value: 65.69579288025889 |
|
- type: cos_sim_recall |
|
value: 85.74445617740233 |
|
- type: dot_accuracy |
|
value: 72.00866269626421 |
|
- type: dot_ap |
|
value: 77.00520104243304 |
|
- type: dot_f1 |
|
value: 74.39303710490151 |
|
- type: dot_precision |
|
value: 65.69579288025889 |
|
- type: dot_recall |
|
value: 85.74445617740233 |
|
- type: euclidean_accuracy |
|
value: 72.00866269626421 |
|
- type: euclidean_ap |
|
value: 77.00520104243304 |
|
- type: euclidean_f1 |
|
value: 74.39303710490151 |
|
- type: euclidean_precision |
|
value: 65.69579288025889 |
|
- type: euclidean_recall |
|
value: 85.74445617740233 |
|
- type: manhattan_accuracy |
|
value: 72.1710882512182 |
|
- type: manhattan_ap |
|
value: 77.00551017913976 |
|
- type: manhattan_f1 |
|
value: 74.23423423423424 |
|
- type: manhattan_precision |
|
value: 64.72898664571878 |
|
- type: manhattan_recall |
|
value: 87.0116156282999 |
|
- type: max_accuracy |
|
value: 72.1710882512182 |
|
- type: max_ap |
|
value: 77.00551017913976 |
|
- type: max_f1 |
|
value: 74.39303710490151 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/OnlineShopping-classification |
|
name: MTEB OnlineShopping |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 88.19000000000001 |
|
- type: ap |
|
value: 85.13415594781077 |
|
- type: f1 |
|
value: 88.17344156114062 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/PAWSX |
|
name: MTEB PAWSX |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 13.70522140998517 |
|
- type: cos_sim_spearman |
|
value: 15.07546667334743 |
|
- type: euclidean_pearson |
|
value: 17.49511420225285 |
|
- type: euclidean_spearman |
|
value: 15.093970931789618 |
|
- type: manhattan_pearson |
|
value: 17.44069961390521 |
|
- type: manhattan_spearman |
|
value: 15.076029291596962 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/QBQTC |
|
name: MTEB QBQTC |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 26.835294224547155 |
|
- type: cos_sim_spearman |
|
value: 27.920204597498856 |
|
- type: euclidean_pearson |
|
value: 26.153796707702803 |
|
- type: euclidean_spearman |
|
value: 27.920971379720548 |
|
- type: manhattan_pearson |
|
value: 26.21954147857523 |
|
- type: manhattan_spearman |
|
value: 27.996860049937478 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: mteb/sts22-crosslingual-sts |
|
name: MTEB STS22 (zh) |
|
config: zh |
|
split: test |
|
revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80 |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 55.15901259718581 |
|
- type: cos_sim_spearman |
|
value: 61.57967880874167 |
|
- type: euclidean_pearson |
|
value: 53.83523291596683 |
|
- type: euclidean_spearman |
|
value: 61.57967880874167 |
|
- type: manhattan_pearson |
|
value: 54.99971428907956 |
|
- type: manhattan_spearman |
|
value: 61.61229543613867 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: mteb/sts22-crosslingual-sts |
|
name: MTEB STS22 (zh-en) |
|
config: zh-en |
|
split: test |
|
revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80 |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 34.20930208460845 |
|
- type: cos_sim_spearman |
|
value: 33.879011104224524 |
|
- type: euclidean_pearson |
|
value: 35.08526425284862 |
|
- type: euclidean_spearman |
|
value: 33.879011104224524 |
|
- type: manhattan_pearson |
|
value: 35.509419089701275 |
|
- type: manhattan_spearman |
|
value: 33.30035487147621 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/STSB |
|
name: MTEB STSB |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 82.30068282185835 |
|
- type: cos_sim_spearman |
|
value: 82.16763221361724 |
|
- type: euclidean_pearson |
|
value: 80.52772752433374 |
|
- type: euclidean_spearman |
|
value: 82.16797037220333 |
|
- type: manhattan_pearson |
|
value: 80.51093859500105 |
|
- type: manhattan_spearman |
|
value: 82.17643310049654 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/T2Reranking |
|
name: MTEB T2Reranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 65.14113035189213 |
|
- type: mrr |
|
value: 74.9589270937443 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/T2Retrieval |
|
name: MTEB T2Retrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 12.013 |
|
- type: map_at_10 |
|
value: 30.885 |
|
- type: map_at_100 |
|
value: 34.643 |
|
- type: map_at_1000 |
|
value: 34.927 |
|
- type: map_at_3 |
|
value: 21.901 |
|
- type: map_at_5 |
|
value: 26.467000000000002 |
|
- type: mrr_at_1 |
|
value: 49.623 |
|
- type: mrr_at_10 |
|
value: 58.05200000000001 |
|
- type: mrr_at_100 |
|
value: 58.61300000000001 |
|
- type: mrr_at_1000 |
|
value: 58.643 |
|
- type: mrr_at_3 |
|
value: 55.947 |
|
- type: mrr_at_5 |
|
value: 57.229 |
|
- type: ndcg_at_1 |
|
value: 49.623 |
|
- type: ndcg_at_10 |
|
value: 41.802 |
|
- type: ndcg_at_100 |
|
value: 49.975 |
|
- type: ndcg_at_1000 |
|
value: 53.504 |
|
- type: ndcg_at_3 |
|
value: 43.515 |
|
- type: ndcg_at_5 |
|
value: 41.576 |
|
- type: precision_at_1 |
|
value: 49.623 |
|
- type: precision_at_10 |
|
value: 22.052 |
|
- type: precision_at_100 |
|
value: 3.6450000000000005 |
|
- type: precision_at_1000 |
|
value: 0.45399999999999996 |
|
- type: precision_at_3 |
|
value: 38.616 |
|
- type: precision_at_5 |
|
value: 31.966 |
|
- type: recall_at_1 |
|
value: 12.013 |
|
- type: recall_at_10 |
|
value: 41.891 |
|
- type: recall_at_100 |
|
value: 67.096 |
|
- type: recall_at_1000 |
|
value: 84.756 |
|
- type: recall_at_3 |
|
value: 24.695 |
|
- type: recall_at_5 |
|
value: 32.09 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/TNews-classification |
|
name: MTEB TNews |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 39.800999999999995 |
|
- type: f1 |
|
value: 38.5345899934575 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/ThuNewsClusteringP2P |
|
name: MTEB ThuNewsClusteringP2P |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 40.16574242797479 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/ThuNewsClusteringS2S |
|
name: MTEB ThuNewsClusteringS2S |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 24.232617974671754 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/VideoRetrieval |
|
name: MTEB VideoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 24.6 |
|
- type: map_at_10 |
|
value: 31.328 |
|
- type: map_at_100 |
|
value: 32.088 |
|
- type: map_at_1000 |
|
value: 32.164 |
|
- type: map_at_3 |
|
value: 29.133 |
|
- type: map_at_5 |
|
value: 30.358 |
|
- type: mrr_at_1 |
|
value: 24.6 |
|
- type: mrr_at_10 |
|
value: 31.328 |
|
- type: mrr_at_100 |
|
value: 32.088 |
|
- type: mrr_at_1000 |
|
value: 32.164 |
|
- type: mrr_at_3 |
|
value: 29.133 |
|
- type: mrr_at_5 |
|
value: 30.358 |
|
- type: ndcg_at_1 |
|
value: 24.6 |
|
- type: ndcg_at_10 |
|
value: 35.150999999999996 |
|
- type: ndcg_at_100 |
|
value: 39.024 |
|
- type: ndcg_at_1000 |
|
value: 41.157 |
|
- type: ndcg_at_3 |
|
value: 30.637999999999998 |
|
- type: ndcg_at_5 |
|
value: 32.833 |
|
- type: precision_at_1 |
|
value: 24.6 |
|
- type: precision_at_10 |
|
value: 4.74 |
|
- type: precision_at_100 |
|
value: 0.66 |
|
- type: precision_at_1000 |
|
value: 0.083 |
|
- type: precision_at_3 |
|
value: 11.667 |
|
- type: precision_at_5 |
|
value: 8.06 |
|
- type: recall_at_1 |
|
value: 24.6 |
|
- type: recall_at_10 |
|
value: 47.4 |
|
- type: recall_at_100 |
|
value: 66 |
|
- type: recall_at_1000 |
|
value: 83 |
|
- type: recall_at_3 |
|
value: 35 |
|
- type: recall_at_5 |
|
value: 40.300000000000004 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/waimai-classification |
|
name: MTEB Waimai |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 83.96000000000001 |
|
- type: ap |
|
value: 65.11027167433211 |
|
- type: f1 |
|
value: 82.03549710974653 |
|
license: apache-2.0 |
|
language: |
|
- zh |
|
--- |
|
|
|
# DMetaSoul/sbert-chinese-general-v1 |
|
|
|
此模型基于 [bert-base-chinese](https://huggingface.co/bert-base-chinese) 版本 BERT 模型,在 NLI、PAWS-X、PKU-Paraphrase-Bank、STS 等语义相似数据集上进行训练,适用于**通用语义匹配**场景(此模型在 Chinese-STS 任务上效果较好,但在其它任务上效果并非最优,存在一定过拟合风险),比如文本特征抽取、文本向量聚类、文本语义搜索等业务场景。 |
|
|
|
注:此模型的[轻量化版本](https://huggingface.co/DMetaSoul/sbert-chinese-general-v1-distill),也已经开源啦! |
|
|
|
# Usage |
|
|
|
## 1. Sentence-Transformers |
|
|
|
通过 [sentence-transformers](https://www.SBERT.net) 框架来使用该模型,首先进行安装: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
然后使用下面的代码来载入该模型并进行文本表征向量的提取: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] |
|
|
|
model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v1') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
## 2. HuggingFace Transformers |
|
|
|
如果不想使用 [sentence-transformers](https://www.SBERT.net) 的话,也可以通过 HuggingFace Transformers 来载入该模型并进行文本向量抽取: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
|
|
#Mean Pooling - Take attention mask into account for correct averaging |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
# Sentences we want sentence embeddings for |
|
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"] |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v1') |
|
model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v1') |
|
|
|
# Tokenize sentences |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
# Perform pooling. In this case, mean pooling. |
|
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) |
|
|
|
print("Sentence embeddings:") |
|
print(sentence_embeddings) |
|
``` |
|
|
|
## Evaluation |
|
|
|
该模型在公开的几个语义匹配数据集上进行了评测,计算了向量相似度跟真实标签之间的相关性系数: |
|
|
|
| | **csts_dev** | **csts_test** | **afqmc** | **lcqmc** | **bqcorpus** | **pawsx** | **xiaobu** | |
|
| ------------ | ------------ | ------------- | --------- | --------- | ------------ | --------- | ---------- | |
|
| **spearman** | 84.54% | 82.17% | 23.80% | 65.94% | 45.52% | 11.52% | 48.51% | |
|
|
|
## Citing & Authors |
|
|
|
E-mail: xiaowenbin@dmetasoul.com |