--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - mteb model-index: - name: mmlw-roberta-base results: - task: type: Clustering dataset: type: PL-MTEB/8tags-clustering name: MTEB 8TagsClustering config: default split: test revision: None metrics: - type: v_measure value: 33.08463724780795 - task: type: Classification dataset: type: PL-MTEB/allegro-reviews name: MTEB AllegroReviews config: default split: test revision: None metrics: - type: accuracy value: 40.25844930417495 - type: f1 value: 35.59685265418916 - task: type: Retrieval dataset: type: arguana-pl name: MTEB ArguAna-PL config: default split: test revision: None metrics: - type: map_at_1 value: 33.073 - type: map_at_10 value: 50.223 - type: map_at_100 value: 50.942 - type: map_at_1000 value: 50.94499999999999 - type: map_at_3 value: 45.721000000000004 - type: map_at_5 value: 48.413000000000004 - type: mrr_at_1 value: 34.424 - type: mrr_at_10 value: 50.68899999999999 - type: mrr_at_100 value: 51.437999999999995 - type: mrr_at_1000 value: 51.441 - type: mrr_at_3 value: 46.219 - type: mrr_at_5 value: 48.921 - type: ndcg_at_1 value: 33.073 - type: ndcg_at_10 value: 59.021 - type: ndcg_at_100 value: 61.902 - type: ndcg_at_1000 value: 61.983999999999995 - type: ndcg_at_3 value: 49.818 - type: ndcg_at_5 value: 54.644999999999996 - type: precision_at_1 value: 33.073 - type: precision_at_10 value: 8.684 - type: precision_at_100 value: 0.9900000000000001 - type: precision_at_1000 value: 0.1 - type: precision_at_3 value: 20.555 - type: precision_at_5 value: 14.666 - type: recall_at_1 value: 33.073 - type: recall_at_10 value: 86.842 - type: recall_at_100 value: 99.004 - type: recall_at_1000 value: 99.644 - type: recall_at_3 value: 61.663999999999994 - type: recall_at_5 value: 73.329 - task: type: Classification dataset: type: PL-MTEB/cbd name: MTEB CBD config: default split: test revision: None metrics: - type: accuracy value: 68.11 - type: ap value: 20.916633959031266 - type: f1 value: 56.85804802205465 - task: type: PairClassification dataset: type: PL-MTEB/cdsce-pairclassification name: MTEB CDSC-E config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 89.2 - type: cos_sim_ap value: 79.1041156765933 - type: cos_sim_f1 value: 70.0 - type: cos_sim_precision value: 74.11764705882354 - type: cos_sim_recall value: 66.3157894736842 - type: dot_accuracy value: 88.2 - type: dot_ap value: 72.57183688228149 - type: dot_f1 value: 67.16417910447761 - type: dot_precision value: 63.67924528301887 - type: dot_recall value: 71.05263157894737 - type: euclidean_accuracy value: 89.3 - type: euclidean_ap value: 79.01345533432428 - type: euclidean_f1 value: 70.19498607242339 - type: euclidean_precision value: 74.55621301775149 - type: euclidean_recall value: 66.3157894736842 - type: manhattan_accuracy value: 89.3 - type: manhattan_ap value: 79.01671381791259 - type: manhattan_f1 value: 70.0280112044818 - type: manhattan_precision value: 74.8502994011976 - type: manhattan_recall value: 65.78947368421053 - type: max_accuracy value: 89.3 - type: max_ap value: 79.1041156765933 - type: max_f1 value: 70.19498607242339 - task: type: STS dataset: type: PL-MTEB/cdscr-sts name: MTEB CDSC-R config: default split: test revision: None metrics: - type: cos_sim_pearson value: 91.79559442663039 - type: cos_sim_spearman value: 92.5438168962641 - type: euclidean_pearson value: 92.02981265332856 - type: euclidean_spearman value: 92.5548245733484 - type: manhattan_pearson value: 91.95296287979178 - type: manhattan_spearman value: 92.50279516120241 - task: type: Retrieval dataset: type: dbpedia-pl name: MTEB DBPedia-PL config: default split: test revision: None metrics: - type: map_at_1 value: 7.829999999999999 - type: map_at_10 value: 16.616 - type: map_at_100 value: 23.629 - type: map_at_1000 value: 25.235999999999997 - type: map_at_3 value: 12.485 - type: map_at_5 value: 14.077 - type: mrr_at_1 value: 61.75000000000001 - type: mrr_at_10 value: 69.852 - type: mrr_at_100 value: 70.279 - type: mrr_at_1000 value: 70.294 - type: mrr_at_3 value: 68.375 - type: mrr_at_5 value: 69.187 - type: ndcg_at_1 value: 49.75 - type: ndcg_at_10 value: 36.217 - type: ndcg_at_100 value: 41.235 - type: ndcg_at_1000 value: 48.952 - type: ndcg_at_3 value: 41.669 - type: ndcg_at_5 value: 38.285000000000004 - type: precision_at_1 value: 61.5 - type: precision_at_10 value: 28.499999999999996 - type: precision_at_100 value: 9.572 - type: precision_at_1000 value: 2.025 - type: precision_at_3 value: 44.083 - type: precision_at_5 value: 36.3 - type: recall_at_1 value: 7.829999999999999 - type: recall_at_10 value: 21.462999999999997 - type: recall_at_100 value: 47.095 - type: recall_at_1000 value: 71.883 - type: recall_at_3 value: 13.891 - type: recall_at_5 value: 16.326999999999998 - task: type: Retrieval dataset: type: fiqa-pl name: MTEB FiQA-PL config: default split: test revision: None metrics: - type: map_at_1 value: 16.950000000000003 - type: map_at_10 value: 27.422 - type: map_at_100 value: 29.146 - type: map_at_1000 value: 29.328 - type: map_at_3 value: 23.735999999999997 - type: map_at_5 value: 25.671 - type: mrr_at_1 value: 33.796 - type: mrr_at_10 value: 42.689 - type: mrr_at_100 value: 43.522 - type: mrr_at_1000 value: 43.563 - type: mrr_at_3 value: 40.226 - type: mrr_at_5 value: 41.685 - type: ndcg_at_1 value: 33.642 - type: ndcg_at_10 value: 35.008 - type: ndcg_at_100 value: 41.839 - type: ndcg_at_1000 value: 45.035 - type: ndcg_at_3 value: 31.358999999999998 - type: ndcg_at_5 value: 32.377 - type: precision_at_1 value: 33.642 - type: precision_at_10 value: 9.937999999999999 - type: precision_at_100 value: 1.685 - type: precision_at_1000 value: 0.22699999999999998 - type: precision_at_3 value: 21.142 - type: precision_at_5 value: 15.586 - type: recall_at_1 value: 16.950000000000003 - type: recall_at_10 value: 42.286 - type: recall_at_100 value: 68.51899999999999 - type: recall_at_1000 value: 87.471 - type: recall_at_3 value: 28.834 - type: recall_at_5 value: 34.274 - task: type: Retrieval dataset: type: hotpotqa-pl name: MTEB HotpotQA-PL config: default split: test revision: None metrics: - type: map_at_1 value: 37.711 - type: map_at_10 value: 57.867999999999995 - type: map_at_100 value: 58.77 - type: map_at_1000 value: 58.836999999999996 - type: map_at_3 value: 54.400999999999996 - type: map_at_5 value: 56.564 - type: mrr_at_1 value: 75.449 - type: mrr_at_10 value: 81.575 - type: mrr_at_100 value: 81.783 - type: mrr_at_1000 value: 81.792 - type: mrr_at_3 value: 80.50399999999999 - type: mrr_at_5 value: 81.172 - type: ndcg_at_1 value: 75.422 - type: ndcg_at_10 value: 66.635 - type: ndcg_at_100 value: 69.85 - type: ndcg_at_1000 value: 71.179 - type: ndcg_at_3 value: 61.648 - type: ndcg_at_5 value: 64.412 - type: precision_at_1 value: 75.422 - type: precision_at_10 value: 13.962 - type: precision_at_100 value: 1.649 - type: precision_at_1000 value: 0.183 - type: precision_at_3 value: 39.172000000000004 - type: precision_at_5 value: 25.691000000000003 - type: recall_at_1 value: 37.711 - type: recall_at_10 value: 69.811 - type: recall_at_100 value: 82.471 - type: recall_at_1000 value: 91.29 - type: recall_at_3 value: 58.757999999999996 - type: recall_at_5 value: 64.227 - task: type: Retrieval dataset: type: msmarco-pl name: MTEB MSMARCO-PL config: default split: validation revision: None metrics: - type: map_at_1 value: 17.033 - type: map_at_10 value: 27.242 - type: map_at_100 value: 28.451999999999998 - type: map_at_1000 value: 28.515 - type: map_at_3 value: 24.046 - type: map_at_5 value: 25.840999999999998 - type: mrr_at_1 value: 17.493 - type: mrr_at_10 value: 27.67 - type: mrr_at_100 value: 28.823999999999998 - type: mrr_at_1000 value: 28.881 - type: mrr_at_3 value: 24.529999999999998 - type: mrr_at_5 value: 26.27 - type: ndcg_at_1 value: 17.479 - type: ndcg_at_10 value: 33.048 - type: ndcg_at_100 value: 39.071 - type: ndcg_at_1000 value: 40.739999999999995 - type: ndcg_at_3 value: 26.493 - type: ndcg_at_5 value: 29.701 - type: precision_at_1 value: 17.479 - type: precision_at_10 value: 5.324 - type: precision_at_100 value: 0.8380000000000001 - type: precision_at_1000 value: 0.098 - type: precision_at_3 value: 11.408999999999999 - type: precision_at_5 value: 8.469999999999999 - type: recall_at_1 value: 17.033 - type: recall_at_10 value: 50.929 - type: recall_at_100 value: 79.262 - type: recall_at_1000 value: 92.239 - type: recall_at_3 value: 33.06 - type: recall_at_5 value: 40.747 - task: type: Classification dataset: type: mteb/amazon_massive_intent name: MTEB MassiveIntentClassification (pl) config: pl split: test revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7 metrics: - type: accuracy value: 72.31002017484867 - type: f1 value: 69.61603671063031 - task: type: Classification dataset: type: mteb/amazon_massive_scenario name: MTEB MassiveScenarioClassification (pl) config: pl split: test revision: 7d571f92784cd94a019292a1f45445077d0ef634 metrics: - type: accuracy value: 75.52790854068594 - type: f1 value: 75.4053872472259 - task: type: Retrieval dataset: type: nfcorpus-pl name: MTEB NFCorpus-PL config: default split: test revision: None metrics: - type: map_at_1 value: 5.877000000000001 - type: map_at_10 value: 12.817 - type: map_at_100 value: 16.247 - type: map_at_1000 value: 17.683 - type: map_at_3 value: 9.334000000000001 - type: map_at_5 value: 10.886999999999999 - type: mrr_at_1 value: 45.201 - type: mrr_at_10 value: 52.7 - type: mrr_at_100 value: 53.425999999999995 - type: mrr_at_1000 value: 53.461000000000006 - type: mrr_at_3 value: 50.464 - type: mrr_at_5 value: 51.827 - type: ndcg_at_1 value: 41.949999999999996 - type: ndcg_at_10 value: 34.144999999999996 - type: ndcg_at_100 value: 31.556 - type: ndcg_at_1000 value: 40.265 - type: ndcg_at_3 value: 38.07 - type: ndcg_at_5 value: 36.571 - type: precision_at_1 value: 44.272 - type: precision_at_10 value: 25.697 - type: precision_at_100 value: 8.077 - type: precision_at_1000 value: 2.084 - type: precision_at_3 value: 36.016999999999996 - type: precision_at_5 value: 31.703 - type: recall_at_1 value: 5.877000000000001 - type: recall_at_10 value: 16.986 - type: recall_at_100 value: 32.719 - type: recall_at_1000 value: 63.763000000000005 - type: recall_at_3 value: 10.292 - type: recall_at_5 value: 12.886000000000001 - task: type: Retrieval dataset: type: nq-pl name: MTEB NQ-PL config: default split: test revision: None metrics: - type: map_at_1 value: 25.476 - type: map_at_10 value: 38.67 - type: map_at_100 value: 39.784000000000006 - type: map_at_1000 value: 39.831 - type: map_at_3 value: 34.829 - type: map_at_5 value: 37.025000000000006 - type: mrr_at_1 value: 28.621000000000002 - type: mrr_at_10 value: 41.13 - type: mrr_at_100 value: 42.028 - type: mrr_at_1000 value: 42.059999999999995 - type: mrr_at_3 value: 37.877 - type: mrr_at_5 value: 39.763999999999996 - type: ndcg_at_1 value: 28.563 - type: ndcg_at_10 value: 45.654 - type: ndcg_at_100 value: 50.695 - type: ndcg_at_1000 value: 51.873999999999995 - type: ndcg_at_3 value: 38.359 - type: ndcg_at_5 value: 42.045 - type: precision_at_1 value: 28.563 - type: precision_at_10 value: 7.6450000000000005 - type: precision_at_100 value: 1.052 - type: precision_at_1000 value: 0.117 - type: precision_at_3 value: 17.458000000000002 - type: precision_at_5 value: 12.613 - type: recall_at_1 value: 25.476 - type: recall_at_10 value: 64.484 - type: recall_at_100 value: 86.96199999999999 - type: recall_at_1000 value: 95.872 - type: recall_at_3 value: 45.527 - type: recall_at_5 value: 54.029 - task: type: Classification dataset: type: laugustyniak/abusive-clauses-pl name: MTEB PAC config: default split: test revision: None metrics: - type: accuracy value: 65.87315377932232 - type: ap value: 76.41966964416534 - type: f1 value: 63.64417488639012 - task: type: PairClassification dataset: type: PL-MTEB/ppc-pairclassification name: MTEB PPC config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 87.7 - type: cos_sim_ap value: 92.81319372631636 - type: cos_sim_f1 value: 90.04048582995952 - type: cos_sim_precision value: 88.11410459587957 - type: cos_sim_recall value: 92.05298013245033 - type: dot_accuracy value: 75.0 - type: dot_ap value: 83.63089957943261 - type: dot_f1 value: 80.76923076923077 - type: dot_precision value: 75.43103448275862 - type: dot_recall value: 86.9205298013245 - type: euclidean_accuracy value: 87.7 - type: euclidean_ap value: 92.94772245932825 - type: euclidean_f1 value: 90.10458567980692 - type: euclidean_precision value: 87.63693270735524 - type: euclidean_recall value: 92.71523178807946 - type: manhattan_accuracy value: 87.8 - type: manhattan_ap value: 92.95330512127123 - type: manhattan_f1 value: 90.08130081300813 - type: manhattan_precision value: 88.49840255591054 - type: manhattan_recall value: 91.72185430463577 - type: max_accuracy value: 87.8 - type: max_ap value: 92.95330512127123 - type: max_f1 value: 90.10458567980692 - task: type: PairClassification dataset: type: PL-MTEB/psc-pairclassification name: MTEB PSC config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 96.19666048237477 - type: cos_sim_ap value: 98.61237969571302 - type: cos_sim_f1 value: 93.77845220030349 - type: cos_sim_precision value: 93.35347432024169 - type: cos_sim_recall value: 94.20731707317073 - type: dot_accuracy value: 94.89795918367348 - type: dot_ap value: 97.02853491357943 - type: dot_f1 value: 91.85185185185186 - type: dot_precision value: 89.33717579250721 - type: dot_recall value: 94.51219512195121 - type: euclidean_accuracy value: 96.38218923933209 - type: euclidean_ap value: 98.58145584134218 - type: euclidean_f1 value: 94.04580152671755 - type: euclidean_precision value: 94.18960244648318 - type: euclidean_recall value: 93.90243902439023 - type: manhattan_accuracy value: 96.47495361781077 - type: manhattan_ap value: 98.6108221024781 - type: manhattan_f1 value: 94.18960244648318 - type: manhattan_precision value: 94.47852760736197 - type: manhattan_recall value: 93.90243902439023 - type: max_accuracy value: 96.47495361781077 - type: max_ap value: 98.61237969571302 - type: max_f1 value: 94.18960244648318 - task: type: Classification dataset: type: PL-MTEB/polemo2_in name: MTEB PolEmo2.0-IN config: default split: test revision: None metrics: - type: accuracy value: 71.73130193905818 - type: f1 value: 71.17731918813324 - task: type: Classification dataset: type: PL-MTEB/polemo2_out name: MTEB PolEmo2.0-OUT config: default split: test revision: None metrics: - type: accuracy value: 46.59919028340081 - type: f1 value: 37.216392949948954 - task: type: Retrieval dataset: type: quora-pl name: MTEB Quora-PL config: default split: test revision: None metrics: - type: map_at_1 value: 66.134 - type: map_at_10 value: 80.19 - type: map_at_100 value: 80.937 - type: map_at_1000 value: 80.95599999999999 - type: map_at_3 value: 77.074 - type: map_at_5 value: 79.054 - type: mrr_at_1 value: 75.88000000000001 - type: mrr_at_10 value: 83.226 - type: mrr_at_100 value: 83.403 - type: mrr_at_1000 value: 83.406 - type: mrr_at_3 value: 82.03200000000001 - type: mrr_at_5 value: 82.843 - type: ndcg_at_1 value: 75.94 - type: ndcg_at_10 value: 84.437 - type: ndcg_at_100 value: 86.13 - type: ndcg_at_1000 value: 86.29299999999999 - type: ndcg_at_3 value: 81.07799999999999 - type: ndcg_at_5 value: 83.0 - type: precision_at_1 value: 75.94 - type: precision_at_10 value: 12.953999999999999 - type: precision_at_100 value: 1.514 - type: precision_at_1000 value: 0.156 - type: precision_at_3 value: 35.61 - type: precision_at_5 value: 23.652 - type: recall_at_1 value: 66.134 - type: recall_at_10 value: 92.991 - type: recall_at_100 value: 99.003 - type: recall_at_1000 value: 99.86 - type: recall_at_3 value: 83.643 - type: recall_at_5 value: 88.81099999999999 - task: type: Retrieval dataset: type: scidocs-pl name: MTEB SCIDOCS-PL config: default split: test revision: None metrics: - type: map_at_1 value: 4.183 - type: map_at_10 value: 10.626 - type: map_at_100 value: 12.485 - type: map_at_1000 value: 12.793 - type: map_at_3 value: 7.531000000000001 - type: map_at_5 value: 9.037 - type: mrr_at_1 value: 20.5 - type: mrr_at_10 value: 30.175 - type: mrr_at_100 value: 31.356 - type: mrr_at_1000 value: 31.421 - type: mrr_at_3 value: 26.900000000000002 - type: mrr_at_5 value: 28.689999999999998 - type: ndcg_at_1 value: 20.599999999999998 - type: ndcg_at_10 value: 17.84 - type: ndcg_at_100 value: 25.518 - type: ndcg_at_1000 value: 31.137999999999998 - type: ndcg_at_3 value: 16.677 - type: ndcg_at_5 value: 14.641000000000002 - type: precision_at_1 value: 20.599999999999998 - type: precision_at_10 value: 9.3 - type: precision_at_100 value: 2.048 - type: precision_at_1000 value: 0.33999999999999997 - type: precision_at_3 value: 15.533 - type: precision_at_5 value: 12.839999999999998 - type: recall_at_1 value: 4.183 - type: recall_at_10 value: 18.862000000000002 - type: recall_at_100 value: 41.592 - type: recall_at_1000 value: 69.037 - type: recall_at_3 value: 9.443 - type: recall_at_5 value: 13.028 - task: type: PairClassification dataset: type: PL-MTEB/sicke-pl-pairclassification name: MTEB SICK-E-PL config: default split: test revision: None metrics: - type: cos_sim_accuracy value: 86.32286995515696 - type: cos_sim_ap value: 82.04302619416443 - type: cos_sim_f1 value: 74.95572086432874 - type: cos_sim_precision value: 74.55954897815363 - type: cos_sim_recall value: 75.35612535612536 - type: dot_accuracy value: 83.9176518548716 - type: dot_ap value: 76.8608733580272 - type: dot_f1 value: 72.31936654569449 - type: dot_precision value: 67.36324523663184 - type: dot_recall value: 78.06267806267806 - type: euclidean_accuracy value: 86.32286995515696 - type: euclidean_ap value: 81.9648986659308 - type: euclidean_f1 value: 74.93796526054591 - type: euclidean_precision value: 74.59421312632321 - type: euclidean_recall value: 75.28490028490027 - type: manhattan_accuracy value: 86.30248675091724 - type: manhattan_ap value: 81.92853980116878 - type: manhattan_f1 value: 74.80968858131489 - type: manhattan_precision value: 72.74562584118439 - type: manhattan_recall value: 76.99430199430199 - type: max_accuracy value: 86.32286995515696 - type: max_ap value: 82.04302619416443 - type: max_f1 value: 74.95572086432874 - task: type: STS dataset: type: PL-MTEB/sickr-pl-sts name: MTEB SICK-R-PL config: default split: test revision: None metrics: - type: cos_sim_pearson value: 83.07566183637853 - type: cos_sim_spearman value: 79.20198022242548 - type: euclidean_pearson value: 81.27875473517936 - type: euclidean_spearman value: 79.21560102311153 - type: manhattan_pearson value: 81.21559474880459 - type: manhattan_spearman value: 79.1537846814979 - task: type: STS dataset: type: mteb/sts22-crosslingual-sts name: MTEB STS22 (pl) config: pl split: test revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80 metrics: - type: cos_sim_pearson value: 36.39657573900194 - type: cos_sim_spearman value: 40.36403461037013 - type: euclidean_pearson value: 29.143416004776316 - type: euclidean_spearman value: 40.43197841306375 - type: manhattan_pearson value: 29.18632337290767 - type: manhattan_spearman value: 40.50563343395481 - task: type: Retrieval dataset: type: scifact-pl name: MTEB SciFact-PL config: default split: test revision: None metrics: - type: map_at_1 value: 49.428 - type: map_at_10 value: 60.423 - type: map_at_100 value: 61.037 - type: map_at_1000 value: 61.065999999999995 - type: map_at_3 value: 56.989000000000004 - type: map_at_5 value: 59.041999999999994 - type: mrr_at_1 value: 52.666999999999994 - type: mrr_at_10 value: 61.746 - type: mrr_at_100 value: 62.273 - type: mrr_at_1000 value: 62.300999999999995 - type: mrr_at_3 value: 59.278 - type: mrr_at_5 value: 60.611000000000004 - type: ndcg_at_1 value: 52.333 - type: ndcg_at_10 value: 65.75 - type: ndcg_at_100 value: 68.566 - type: ndcg_at_1000 value: 69.314 - type: ndcg_at_3 value: 59.768 - type: ndcg_at_5 value: 62.808 - type: precision_at_1 value: 52.333 - type: precision_at_10 value: 9.167 - type: precision_at_100 value: 1.0630000000000002 - type: precision_at_1000 value: 0.11299999999999999 - type: precision_at_3 value: 23.778 - type: precision_at_5 value: 16.2 - type: recall_at_1 value: 49.428 - type: recall_at_10 value: 81.07799999999999 - type: recall_at_100 value: 93.93299999999999 - type: recall_at_1000 value: 99.667 - type: recall_at_3 value: 65.061 - type: recall_at_5 value: 72.667 - task: type: Retrieval dataset: type: trec-covid-pl name: MTEB TRECCOVID-PL config: default split: test revision: None metrics: - type: map_at_1 value: 0.22100000000000003 - type: map_at_10 value: 1.788 - type: map_at_100 value: 9.937 - type: map_at_1000 value: 24.762999999999998 - type: map_at_3 value: 0.579 - type: map_at_5 value: 0.947 - type: mrr_at_1 value: 78.0 - type: mrr_at_10 value: 88.067 - type: mrr_at_100 value: 88.067 - type: mrr_at_1000 value: 88.067 - type: mrr_at_3 value: 87.667 - type: mrr_at_5 value: 88.067 - type: ndcg_at_1 value: 76.0 - type: ndcg_at_10 value: 71.332 - type: ndcg_at_100 value: 54.80500000000001 - type: ndcg_at_1000 value: 49.504999999999995 - type: ndcg_at_3 value: 73.693 - type: ndcg_at_5 value: 73.733 - type: precision_at_1 value: 82.0 - type: precision_at_10 value: 76.8 - type: precision_at_100 value: 56.68 - type: precision_at_1000 value: 22.236 - type: precision_at_3 value: 78.667 - type: precision_at_5 value: 79.2 - type: recall_at_1 value: 0.22100000000000003 - type: recall_at_10 value: 2.033 - type: recall_at_100 value: 13.431999999999999 - type: recall_at_1000 value: 46.913 - type: recall_at_3 value: 0.625 - type: recall_at_5 value: 1.052 language: pl license: apache-2.0 widget: - source_sentence: "zapytanie: Jak dożyć 100 lat?" sentences: - "Trzeba zdrowo się odżywiać i uprawiać sport." - "Trzeba pić alkohol, imprezować i jeździć szybkimi autami." - "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." ---

MMLW-roberta-base

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning. It transforms texts to 768 dimensional vectors. The model was initialized with Polish RoBERTa checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation. ## Usage (Sentence-Transformers) ⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, each query should be preceded by the prefix **"zapytanie: "** ⚠️ You can use the model like this with [sentence-transformers](https://www.SBERT.net): ```python from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim query_prefix = "zapytanie: " answer_prefix = "" queries = [query_prefix + "Jak dożyć 100 lat?"] answers = [ answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.", answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.", answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu." ] model = SentenceTransformer("sdadas/mmlw-roberta-base") queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False) answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False) best_answer = cos_sim(queries_emb, answers_emb).argmax().item() print(answers[best_answer]) # Trzeba zdrowo się odżywiać i uprawiać sport. ``` ## Evaluation Results - The model achieves an **Average Score** of **61.05** on the Polish Massive Text Embedding Benchmark (MTEB). See [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for detailed results. - The model achieves **NDCG@10** of **53.60** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results. ## Acknowledgements This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative. ## Citation ```bibtex @article{dadas2024pirb, title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata}, year={2024}, eprint={2402.13350}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```