xiaowenbin's picture
Update README.md
5e1a390 verified
metadata
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
  - semantic-search
  - chinese
  - mteb
model-index:
  - name: sbert-chinese-general-v1
    results:
      - task:
          type: STS
        dataset:
          type: C-MTEB/AFQMC
          name: MTEB AFQMC
          config: default
          split: validation
          revision: None
        metrics:
          - type: cos_sim_pearson
            value: 22.293919432958074
          - type: cos_sim_spearman
            value: 22.56718923553609
          - type: euclidean_pearson
            value: 22.525656322797026
          - type: euclidean_spearman
            value: 22.56718923553609
          - type: manhattan_pearson
            value: 22.501773028824065
          - type: manhattan_spearman
            value: 22.536992587828397
      - task:
          type: STS
        dataset:
          type: C-MTEB/ATEC
          name: MTEB ATEC
          config: default
          split: test
          revision: None
        metrics:
          - type: cos_sim_pearson
            value: 30.33575274463879
          - type: cos_sim_spearman
            value: 30.298708742167772
          - type: euclidean_pearson
            value: 32.33094743729218
          - type: euclidean_spearman
            value: 30.298710993858734
          - type: manhattan_pearson
            value: 32.31155376195945
          - type: manhattan_spearman
            value: 30.267669681690744
      - task:
          type: Classification
        dataset:
          type: mteb/amazon_reviews_multi
          name: MTEB AmazonReviewsClassification (zh)
          config: zh
          split: test
          revision: 1399c76144fd37290681b995c656ef9b2e06e26d
        metrics:
          - type: accuracy
            value: 37.507999999999996
          - type: f1
            value: 36.436808400753286
      - task:
          type: STS
        dataset:
          type: C-MTEB/BQ
          name: MTEB BQ
          config: default
          split: test
          revision: None
        metrics:
          - type: cos_sim_pearson
            value: 41.493256724214255
          - type: cos_sim_spearman
            value: 40.98395961967895
          - type: euclidean_pearson
            value: 41.12345737966565
          - type: euclidean_spearman
            value: 40.983959619555996
          - type: manhattan_pearson
            value: 41.02584539471014
          - type: manhattan_spearman
            value: 40.87549513383032
      - task:
          type: BitextMining
        dataset:
          type: mteb/bucc-bitext-mining
          name: MTEB BUCC (zh-en)
          config: zh-en
          split: test
          revision: d51519689f32196a32af33b075a01d0e7c51e252
        metrics:
          - type: accuracy
            value: 9.794628751974724
          - type: f1
            value: 9.350535369492716
          - type: precision
            value: 9.179392662804986
          - type: recall
            value: 9.794628751974724
      - task:
          type: Clustering
        dataset:
          type: C-MTEB/CLSClusteringP2P
          name: MTEB CLSClusteringP2P
          config: default
          split: test
          revision: None
        metrics:
          - type: v_measure
            value: 34.984726547788284
      - task:
          type: Clustering
        dataset:
          type: C-MTEB/CLSClusteringS2S
          name: MTEB CLSClusteringS2S
          config: default
          split: test
          revision: None
        metrics:
          - type: v_measure
            value: 27.81945732281589
      - task:
          type: Reranking
        dataset:
          type: C-MTEB/CMedQAv1-reranking
          name: MTEB CMedQAv1
          config: default
          split: test
          revision: None
        metrics:
          - type: map
            value: 53.06586280826805
          - type: mrr
            value: 59.58781746031746
      - task:
          type: Reranking
        dataset:
          type: C-MTEB/CMedQAv2-reranking
          name: MTEB CMedQAv2
          config: default
          split: test
          revision: None
        metrics:
          - type: map
            value: 52.83635946154306
          - type: mrr
            value: 59.315079365079356
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/CmedqaRetrieval
          name: MTEB CmedqaRetrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 5.721
          - type: map_at_10
            value: 8.645
          - type: map_at_100
            value: 9.434
          - type: map_at_1000
            value: 9.586
          - type: map_at_3
            value: 7.413
          - type: map_at_5
            value: 8.05
          - type: mrr_at_1
            value: 9.626999999999999
          - type: mrr_at_10
            value: 13.094
          - type: mrr_at_100
            value: 13.854
          - type: mrr_at_1000
            value: 13.958
          - type: mrr_at_3
            value: 11.724
          - type: mrr_at_5
            value: 12.409
          - type: ndcg_at_1
            value: 9.626999999999999
          - type: ndcg_at_10
            value: 11.35
          - type: ndcg_at_100
            value: 15.593000000000002
          - type: ndcg_at_1000
            value: 19.619
          - type: ndcg_at_3
            value: 9.317
          - type: ndcg_at_5
            value: 10.049
          - type: precision_at_1
            value: 9.626999999999999
          - type: precision_at_10
            value: 2.796
          - type: precision_at_100
            value: 0.629
          - type: precision_at_1000
            value: 0.11800000000000001
          - type: precision_at_3
            value: 5.476
          - type: precision_at_5
            value: 4.1209999999999996
          - type: recall_at_1
            value: 5.721
          - type: recall_at_10
            value: 15.190000000000001
          - type: recall_at_100
            value: 33.633
          - type: recall_at_1000
            value: 62.019999999999996
          - type: recall_at_3
            value: 9.099
          - type: recall_at_5
            value: 11.423
      - task:
          type: PairClassification
        dataset:
          type: C-MTEB/CMNLI
          name: MTEB Cmnli
          config: default
          split: validation
          revision: None
        metrics:
          - type: cos_sim_accuracy
            value: 77.36620565243535
          - type: cos_sim_ap
            value: 85.92291866877001
          - type: cos_sim_f1
            value: 78.19390231037029
          - type: cos_sim_precision
            value: 71.24183006535948
          - type: cos_sim_recall
            value: 86.64952069207388
          - type: dot_accuracy
            value: 77.36620565243535
          - type: dot_ap
            value: 85.94113738490068
          - type: dot_f1
            value: 78.19390231037029
          - type: dot_precision
            value: 71.24183006535948
          - type: dot_recall
            value: 86.64952069207388
          - type: euclidean_accuracy
            value: 77.36620565243535
          - type: euclidean_ap
            value: 85.92291893444687
          - type: euclidean_f1
            value: 78.19390231037029
          - type: euclidean_precision
            value: 71.24183006535948
          - type: euclidean_recall
            value: 86.64952069207388
          - type: manhattan_accuracy
            value: 77.29404690318701
          - type: manhattan_ap
            value: 85.88284362100919
          - type: manhattan_f1
            value: 78.17836812144213
          - type: manhattan_precision
            value: 71.18448838548666
          - type: manhattan_recall
            value: 86.69628244096329
          - type: max_accuracy
            value: 77.36620565243535
          - type: max_ap
            value: 85.94113738490068
          - type: max_f1
            value: 78.19390231037029
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/CovidRetrieval
          name: MTEB CovidRetrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 26.976
          - type: map_at_10
            value: 35.18
          - type: map_at_100
            value: 35.921
          - type: map_at_1000
            value: 35.998999999999995
          - type: map_at_3
            value: 32.763
          - type: map_at_5
            value: 34.165
          - type: mrr_at_1
            value: 26.976
          - type: mrr_at_10
            value: 35.234
          - type: mrr_at_100
            value: 35.939
          - type: mrr_at_1000
            value: 36.016
          - type: mrr_at_3
            value: 32.771
          - type: mrr_at_5
            value: 34.172999999999995
          - type: ndcg_at_1
            value: 26.976
          - type: ndcg_at_10
            value: 39.635
          - type: ndcg_at_100
            value: 43.54
          - type: ndcg_at_1000
            value: 45.723
          - type: ndcg_at_3
            value: 34.652
          - type: ndcg_at_5
            value: 37.186
          - type: precision_at_1
            value: 26.976
          - type: precision_at_10
            value: 5.406
          - type: precision_at_100
            value: 0.736
          - type: precision_at_1000
            value: 0.091
          - type: precision_at_3
            value: 13.418
          - type: precision_at_5
            value: 9.293999999999999
          - type: recall_at_1
            value: 26.976
          - type: recall_at_10
            value: 53.766999999999996
          - type: recall_at_100
            value: 72.761
          - type: recall_at_1000
            value: 90.148
          - type: recall_at_3
            value: 40.095
          - type: recall_at_5
            value: 46.233000000000004
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/DuRetrieval
          name: MTEB DuRetrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 11.285
          - type: map_at_10
            value: 30.259000000000004
          - type: map_at_100
            value: 33.772000000000006
          - type: map_at_1000
            value: 34.037
          - type: map_at_3
            value: 21.038999999999998
          - type: map_at_5
            value: 25.939
          - type: mrr_at_1
            value: 45.1
          - type: mrr_at_10
            value: 55.803999999999995
          - type: mrr_at_100
            value: 56.301
          - type: mrr_at_1000
            value: 56.330999999999996
          - type: mrr_at_3
            value: 53.333
          - type: mrr_at_5
            value: 54.798
          - type: ndcg_at_1
            value: 45.1
          - type: ndcg_at_10
            value: 41.156
          - type: ndcg_at_100
            value: 49.518
          - type: ndcg_at_1000
            value: 52.947
          - type: ndcg_at_3
            value: 39.708
          - type: ndcg_at_5
            value: 38.704
          - type: precision_at_1
            value: 45.1
          - type: precision_at_10
            value: 20.75
          - type: precision_at_100
            value: 3.424
          - type: precision_at_1000
            value: 0.42700000000000005
          - type: precision_at_3
            value: 35.632999999999996
          - type: precision_at_5
            value: 30.080000000000002
          - type: recall_at_1
            value: 11.285
          - type: recall_at_10
            value: 43.242000000000004
          - type: recall_at_100
            value: 68.604
          - type: recall_at_1000
            value: 85.904
          - type: recall_at_3
            value: 24.404
          - type: recall_at_5
            value: 32.757
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/EcomRetrieval
          name: MTEB EcomRetrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 21
          - type: map_at_10
            value: 28.364
          - type: map_at_100
            value: 29.199
          - type: map_at_1000
            value: 29.265
          - type: map_at_3
            value: 25.717000000000002
          - type: map_at_5
            value: 27.311999999999998
          - type: mrr_at_1
            value: 21
          - type: mrr_at_10
            value: 28.364
          - type: mrr_at_100
            value: 29.199
          - type: mrr_at_1000
            value: 29.265
          - type: mrr_at_3
            value: 25.717000000000002
          - type: mrr_at_5
            value: 27.311999999999998
          - type: ndcg_at_1
            value: 21
          - type: ndcg_at_10
            value: 32.708
          - type: ndcg_at_100
            value: 37.184
          - type: ndcg_at_1000
            value: 39.273
          - type: ndcg_at_3
            value: 27.372000000000003
          - type: ndcg_at_5
            value: 30.23
          - type: precision_at_1
            value: 21
          - type: precision_at_10
            value: 4.66
          - type: precision_at_100
            value: 0.685
          - type: precision_at_1000
            value: 0.086
          - type: precision_at_3
            value: 10.732999999999999
          - type: precision_at_5
            value: 7.82
          - type: recall_at_1
            value: 21
          - type: recall_at_10
            value: 46.6
          - type: recall_at_100
            value: 68.5
          - type: recall_at_1000
            value: 85.6
          - type: recall_at_3
            value: 32.2
          - type: recall_at_5
            value: 39.1
      - task:
          type: Classification
        dataset:
          type: C-MTEB/IFlyTek-classification
          name: MTEB IFlyTek
          config: default
          split: validation
          revision: None
        metrics:
          - type: accuracy
            value: 44.878799538283964
          - type: f1
            value: 33.84678310261366
      - task:
          type: Classification
        dataset:
          type: C-MTEB/JDReview-classification
          name: MTEB JDReview
          config: default
          split: test
          revision: None
        metrics:
          - type: accuracy
            value: 82.1951219512195
          - type: ap
            value: 46.78292030042397
          - type: f1
            value: 76.20482468514128
      - task:
          type: STS
        dataset:
          type: C-MTEB/LCQMC
          name: MTEB LCQMC
          config: default
          split: test
          revision: None
        metrics:
          - type: cos_sim_pearson
            value: 62.84331627244547
          - type: cos_sim_spearman
            value: 68.39990265073726
          - type: euclidean_pearson
            value: 66.87431827169324
          - type: euclidean_spearman
            value: 68.39990264979167
          - type: manhattan_pearson
            value: 66.89702078900328
          - type: manhattan_spearman
            value: 68.42107302159141
      - task:
          type: Reranking
        dataset:
          type: C-MTEB/Mmarco-reranking
          name: MTEB MMarcoReranking
          config: default
          split: dev
          revision: None
        metrics:
          - type: map
            value: 9.28600891904827
          - type: mrr
            value: 8.057936507936509
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/MMarcoRetrieval
          name: MTEB MMarcoRetrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 22.820999999999998
          - type: map_at_10
            value: 30.44
          - type: map_at_100
            value: 31.35
          - type: map_at_1000
            value: 31.419000000000004
          - type: map_at_3
            value: 28.134999999999998
          - type: map_at_5
            value: 29.482000000000003
          - type: mrr_at_1
            value: 23.782
          - type: mrr_at_10
            value: 31.141999999999996
          - type: mrr_at_100
            value: 32.004
          - type: mrr_at_1000
            value: 32.068000000000005
          - type: mrr_at_3
            value: 28.904000000000003
          - type: mrr_at_5
            value: 30.214999999999996
          - type: ndcg_at_1
            value: 23.782
          - type: ndcg_at_10
            value: 34.625
          - type: ndcg_at_100
            value: 39.226
          - type: ndcg_at_1000
            value: 41.128
          - type: ndcg_at_3
            value: 29.968
          - type: ndcg_at_5
            value: 32.35
          - type: precision_at_1
            value: 23.782
          - type: precision_at_10
            value: 4.994
          - type: precision_at_100
            value: 0.736
          - type: precision_at_1000
            value: 0.09
          - type: precision_at_3
            value: 12.13
          - type: precision_at_5
            value: 8.495999999999999
          - type: recall_at_1
            value: 22.820999999999998
          - type: recall_at_10
            value: 47.141
          - type: recall_at_100
            value: 68.952
          - type: recall_at_1000
            value: 83.985
          - type: recall_at_3
            value: 34.508
          - type: recall_at_5
            value: 40.232
      - task:
          type: Classification
        dataset:
          type: mteb/amazon_massive_intent
          name: MTEB MassiveIntentClassification (zh-CN)
          config: zh-CN
          split: test
          revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
        metrics:
          - type: accuracy
            value: 57.343644922663074
          - type: f1
            value: 56.744802953803486
      - task:
          type: Classification
        dataset:
          type: mteb/amazon_massive_scenario
          name: MTEB MassiveScenarioClassification (zh-CN)
          config: zh-CN
          split: test
          revision: 7d571f92784cd94a019292a1f45445077d0ef634
        metrics:
          - type: accuracy
            value: 62.363819771351714
          - type: f1
            value: 62.15920863434656
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/MedicalRetrieval
          name: MTEB MedicalRetrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 14.6
          - type: map_at_10
            value: 18.231
          - type: map_at_100
            value: 18.744
          - type: map_at_1000
            value: 18.811
          - type: map_at_3
            value: 17.133000000000003
          - type: map_at_5
            value: 17.663
          - type: mrr_at_1
            value: 14.6
          - type: mrr_at_10
            value: 18.231
          - type: mrr_at_100
            value: 18.744
          - type: mrr_at_1000
            value: 18.811
          - type: mrr_at_3
            value: 17.133000000000003
          - type: mrr_at_5
            value: 17.663
          - type: ndcg_at_1
            value: 14.6
          - type: ndcg_at_10
            value: 20.349
          - type: ndcg_at_100
            value: 23.204
          - type: ndcg_at_1000
            value: 25.44
          - type: ndcg_at_3
            value: 17.995
          - type: ndcg_at_5
            value: 18.945999999999998
          - type: precision_at_1
            value: 14.6
          - type: precision_at_10
            value: 2.7199999999999998
          - type: precision_at_100
            value: 0.414
          - type: precision_at_1000
            value: 0.06
          - type: precision_at_3
            value: 6.833
          - type: precision_at_5
            value: 4.5600000000000005
          - type: recall_at_1
            value: 14.6
          - type: recall_at_10
            value: 27.200000000000003
          - type: recall_at_100
            value: 41.4
          - type: recall_at_1000
            value: 60
          - type: recall_at_3
            value: 20.5
          - type: recall_at_5
            value: 22.8
      - task:
          type: Classification
        dataset:
          type: C-MTEB/MultilingualSentiment-classification
          name: MTEB MultilingualSentiment
          config: default
          split: validation
          revision: None
        metrics:
          - type: accuracy
            value: 66.58333333333333
          - type: f1
            value: 66.26700927460007
      - task:
          type: PairClassification
        dataset:
          type: C-MTEB/OCNLI
          name: MTEB Ocnli
          config: default
          split: validation
          revision: None
        metrics:
          - type: cos_sim_accuracy
            value: 72.00866269626421
          - type: cos_sim_ap
            value: 77.00520104243304
          - type: cos_sim_f1
            value: 74.39303710490151
          - type: cos_sim_precision
            value: 65.69579288025889
          - type: cos_sim_recall
            value: 85.74445617740233
          - type: dot_accuracy
            value: 72.00866269626421
          - type: dot_ap
            value: 77.00520104243304
          - type: dot_f1
            value: 74.39303710490151
          - type: dot_precision
            value: 65.69579288025889
          - type: dot_recall
            value: 85.74445617740233
          - type: euclidean_accuracy
            value: 72.00866269626421
          - type: euclidean_ap
            value: 77.00520104243304
          - type: euclidean_f1
            value: 74.39303710490151
          - type: euclidean_precision
            value: 65.69579288025889
          - type: euclidean_recall
            value: 85.74445617740233
          - type: manhattan_accuracy
            value: 72.1710882512182
          - type: manhattan_ap
            value: 77.00551017913976
          - type: manhattan_f1
            value: 74.23423423423424
          - type: manhattan_precision
            value: 64.72898664571878
          - type: manhattan_recall
            value: 87.0116156282999
          - type: max_accuracy
            value: 72.1710882512182
          - type: max_ap
            value: 77.00551017913976
          - type: max_f1
            value: 74.39303710490151
      - task:
          type: Classification
        dataset:
          type: C-MTEB/OnlineShopping-classification
          name: MTEB OnlineShopping
          config: default
          split: test
          revision: None
        metrics:
          - type: accuracy
            value: 88.19000000000001
          - type: ap
            value: 85.13415594781077
          - type: f1
            value: 88.17344156114062
      - task:
          type: STS
        dataset:
          type: C-MTEB/PAWSX
          name: MTEB PAWSX
          config: default
          split: test
          revision: None
        metrics:
          - type: cos_sim_pearson
            value: 13.70522140998517
          - type: cos_sim_spearman
            value: 15.07546667334743
          - type: euclidean_pearson
            value: 17.49511420225285
          - type: euclidean_spearman
            value: 15.093970931789618
          - type: manhattan_pearson
            value: 17.44069961390521
          - type: manhattan_spearman
            value: 15.076029291596962
      - task:
          type: STS
        dataset:
          type: C-MTEB/QBQTC
          name: MTEB QBQTC
          config: default
          split: test
          revision: None
        metrics:
          - type: cos_sim_pearson
            value: 26.835294224547155
          - type: cos_sim_spearman
            value: 27.920204597498856
          - type: euclidean_pearson
            value: 26.153796707702803
          - type: euclidean_spearman
            value: 27.920971379720548
          - type: manhattan_pearson
            value: 26.21954147857523
          - type: manhattan_spearman
            value: 27.996860049937478
      - task:
          type: STS
        dataset:
          type: mteb/sts22-crosslingual-sts
          name: MTEB STS22 (zh)
          config: zh
          split: test
          revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
        metrics:
          - type: cos_sim_pearson
            value: 55.15901259718581
          - type: cos_sim_spearman
            value: 61.57967880874167
          - type: euclidean_pearson
            value: 53.83523291596683
          - type: euclidean_spearman
            value: 61.57967880874167
          - type: manhattan_pearson
            value: 54.99971428907956
          - type: manhattan_spearman
            value: 61.61229543613867
      - task:
          type: STS
        dataset:
          type: mteb/sts22-crosslingual-sts
          name: MTEB STS22 (zh-en)
          config: zh-en
          split: test
          revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
        metrics:
          - type: cos_sim_pearson
            value: 34.20930208460845
          - type: cos_sim_spearman
            value: 33.879011104224524
          - type: euclidean_pearson
            value: 35.08526425284862
          - type: euclidean_spearman
            value: 33.879011104224524
          - type: manhattan_pearson
            value: 35.509419089701275
          - type: manhattan_spearman
            value: 33.30035487147621
      - task:
          type: STS
        dataset:
          type: C-MTEB/STSB
          name: MTEB STSB
          config: default
          split: test
          revision: None
        metrics:
          - type: cos_sim_pearson
            value: 82.30068282185835
          - type: cos_sim_spearman
            value: 82.16763221361724
          - type: euclidean_pearson
            value: 80.52772752433374
          - type: euclidean_spearman
            value: 82.16797037220333
          - type: manhattan_pearson
            value: 80.51093859500105
          - type: manhattan_spearman
            value: 82.17643310049654
      - task:
          type: Reranking
        dataset:
          type: C-MTEB/T2Reranking
          name: MTEB T2Reranking
          config: default
          split: dev
          revision: None
        metrics:
          - type: map
            value: 65.14113035189213
          - type: mrr
            value: 74.9589270937443
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/T2Retrieval
          name: MTEB T2Retrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 12.013
          - type: map_at_10
            value: 30.885
          - type: map_at_100
            value: 34.643
          - type: map_at_1000
            value: 34.927
          - type: map_at_3
            value: 21.901
          - type: map_at_5
            value: 26.467000000000002
          - type: mrr_at_1
            value: 49.623
          - type: mrr_at_10
            value: 58.05200000000001
          - type: mrr_at_100
            value: 58.61300000000001
          - type: mrr_at_1000
            value: 58.643
          - type: mrr_at_3
            value: 55.947
          - type: mrr_at_5
            value: 57.229
          - type: ndcg_at_1
            value: 49.623
          - type: ndcg_at_10
            value: 41.802
          - type: ndcg_at_100
            value: 49.975
          - type: ndcg_at_1000
            value: 53.504
          - type: ndcg_at_3
            value: 43.515
          - type: ndcg_at_5
            value: 41.576
          - type: precision_at_1
            value: 49.623
          - type: precision_at_10
            value: 22.052
          - type: precision_at_100
            value: 3.6450000000000005
          - type: precision_at_1000
            value: 0.45399999999999996
          - type: precision_at_3
            value: 38.616
          - type: precision_at_5
            value: 31.966
          - type: recall_at_1
            value: 12.013
          - type: recall_at_10
            value: 41.891
          - type: recall_at_100
            value: 67.096
          - type: recall_at_1000
            value: 84.756
          - type: recall_at_3
            value: 24.695
          - type: recall_at_5
            value: 32.09
      - task:
          type: Classification
        dataset:
          type: C-MTEB/TNews-classification
          name: MTEB TNews
          config: default
          split: validation
          revision: None
        metrics:
          - type: accuracy
            value: 39.800999999999995
          - type: f1
            value: 38.5345899934575
      - task:
          type: Clustering
        dataset:
          type: C-MTEB/ThuNewsClusteringP2P
          name: MTEB ThuNewsClusteringP2P
          config: default
          split: test
          revision: None
        metrics:
          - type: v_measure
            value: 40.16574242797479
      - task:
          type: Clustering
        dataset:
          type: C-MTEB/ThuNewsClusteringS2S
          name: MTEB ThuNewsClusteringS2S
          config: default
          split: test
          revision: None
        metrics:
          - type: v_measure
            value: 24.232617974671754
      - task:
          type: Retrieval
        dataset:
          type: C-MTEB/VideoRetrieval
          name: MTEB VideoRetrieval
          config: default
          split: dev
          revision: None
        metrics:
          - type: map_at_1
            value: 24.6
          - type: map_at_10
            value: 31.328
          - type: map_at_100
            value: 32.088
          - type: map_at_1000
            value: 32.164
          - type: map_at_3
            value: 29.133
          - type: map_at_5
            value: 30.358
          - type: mrr_at_1
            value: 24.6
          - type: mrr_at_10
            value: 31.328
          - type: mrr_at_100
            value: 32.088
          - type: mrr_at_1000
            value: 32.164
          - type: mrr_at_3
            value: 29.133
          - type: mrr_at_5
            value: 30.358
          - type: ndcg_at_1
            value: 24.6
          - type: ndcg_at_10
            value: 35.150999999999996
          - type: ndcg_at_100
            value: 39.024
          - type: ndcg_at_1000
            value: 41.157
          - type: ndcg_at_3
            value: 30.637999999999998
          - type: ndcg_at_5
            value: 32.833
          - type: precision_at_1
            value: 24.6
          - type: precision_at_10
            value: 4.74
          - type: precision_at_100
            value: 0.66
          - type: precision_at_1000
            value: 0.083
          - type: precision_at_3
            value: 11.667
          - type: precision_at_5
            value: 8.06
          - type: recall_at_1
            value: 24.6
          - type: recall_at_10
            value: 47.4
          - type: recall_at_100
            value: 66
          - type: recall_at_1000
            value: 83
          - type: recall_at_3
            value: 35
          - type: recall_at_5
            value: 40.300000000000004
      - task:
          type: Classification
        dataset:
          type: C-MTEB/waimai-classification
          name: MTEB Waimai
          config: default
          split: test
          revision: None
        metrics:
          - type: accuracy
            value: 83.96000000000001
          - type: ap
            value: 65.11027167433211
          - type: f1
            value: 82.03549710974653
license: apache-2.0
language:
  - zh

DMetaSoul/sbert-chinese-general-v1

此模型基于 bert-base-chinese 版本 BERT 模型,在 NLI、PAWS-X、PKU-Paraphrase-Bank、STS 等语义相似数据集上进行训练,适用于通用语义匹配场景(此模型在 Chinese-STS 任务上效果较好,但在其它任务上效果并非最优,存在一定过拟合风险),比如文本特征抽取、文本向量聚类、文本语义搜索等业务场景。

注:此模型的轻量化版本,也已经开源啦!

Usage

1. Sentence-Transformers

通过 sentence-transformers 框架来使用该模型,首先进行安装:

pip install -U sentence-transformers

然后使用下面的代码来载入该模型并进行文本表征向量的提取:

from sentence_transformers import SentenceTransformer
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]

model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v1')
embeddings = model.encode(sentences)
print(embeddings)

2. HuggingFace Transformers

如果不想使用 sentence-transformers 的话,也可以通过 HuggingFace Transformers 来载入该模型并进行文本向量抽取:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v1')
model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation

该模型在公开的几个语义匹配数据集上进行了评测,计算了向量相似度跟真实标签之间的相关性系数:

csts_dev csts_test afqmc lcqmc bqcorpus pawsx xiaobu
spearman 84.54% 82.17% 23.80% 65.94% 45.52% 11.52% 48.51%

Citing & Authors

E-mail: xiaowenbin@dmetasoul.com