File size: 10,117 Bytes
dfb6421
397a4c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dfb6421
397a4c7
 
dfb6421
397a4c7
055975b
397a4c7
 
 
 
 
 
 
 
 
 
 
 
 
 
754d90e
397a4c7
 
 
1940b04
397a4c7
 
 
754d90e
397a4c7
16f5cf9
397a4c7
 
 
1940b04
397a4c7
 
 
754d90e
397a4c7
 
 
 
055975b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---

language:
- ru

pipeline_tag: sentence-similarity

tags:
- russian
- pretraining
- embeddings
- feature-extraction
- sentence-similarity
- sentence-transformers
- transformers

datasets:
- IlyaGusev/gazeta
- zloelias/lenta-ru

license: mit
base_model: cointegrated/LaBSE-en-ru

---


Модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.


## Использование:
```Python

from sentence_transformers import SentenceTransformer, util



model = SentenceTransformer('sergeyzh/LaBSE-ru-turbo')



sentences = ["привет мир", "hello world", "здравствуй вселенная"]

embeddings = model.encode(sentences)

print(util.dot_score(embeddings, embeddings))

```

## Метрики
Оценки модели на бенчмарке [encodechka](https://github.com/sergeyz-zh/encodechka):

| Model                              | CPU       | GPU      | size     |   Mean S  | Mean S+W   |   dim  |
|:-----------------------------------|----------:|---------:|---------:|----------:|-----------:|-------:|
| **sergeyzh/LaBSE-ru-turbo**        |   120.40  |   8.05   |  490     |    0.789  |   0.702    |   768  |
| BAAI/bge-m3                        |   523.40  |  22.50   | 2166     |    0.787  |   0.696    |  1024  |
| intfloat/multilingual-e5-large     |   506.80  |  30.80   | 2136     |    0.780  |   0.686    |  1024  |
| intfloat/multilingual-e5-base      |   130.61  |  14.39   | 1061     |    0.761  |   0.669    |   768  |
| [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo)         |     5.51  |   3.25   |  111     |    0.749  |   0.667    |   312  |
| intfloat/multilingual-e5-small     |    40.86  |  12.09   |  449     |    0.742  |   0.645    |   384  |
| cointegrated/LaBSE-en-ru           |   120.40  |   8.05   |  490     |    0.739  |   0.667    |   768  |

| Model                              | STS      | PI       | NLI      | SA       | TI       | IA       | IC       | ICX      | NE1      | NE2      |
|:-----------------------------------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
| **sergeyzh/LaBSE-ru-turbo**        |   0.864  |  0.748   |  0.490   |  0.814   |  0.974   |  0.806   |  0.815   |  0.801   |  0.305   |  0.404   |
| BAAI/bge-m3                        |  0.864   |  0.749   |  0.510   |  0.819   |  0.973   |  0.792   |  0.809   |  0.783   |  0.240   |  0.422   |
| intfloat/multilingual-e5-large     |  0.862   |  0.727   |  0.473   |  0.810   |  0.979   |  0.798   |  0.819   |  0.773   |  0.224   |  0.374   |
| intfloat/multilingual-e5-base      |  0.835   |  0.704   |  0.459   |  0.796   |  0.964   |  0.783   |  0.802   |  0.738   |  0.235   |  0.376   |
| [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo)         |  0.828   |  0.722   |  0.476   |  0.787   |  0.955   |  0.757   |  0.780   |  0.685   |  0.305   |  0.373   |
| intfloat/multilingual-e5-small     |  0.822   |  0.714   |  0.457   |  0.758   |  0.957   |  0.761   |  0.779   |  0.691   |  0.234   |  0.275   |
| cointegrated/LaBSE-en-ru           |  0.794   |  0.659   |  0.431   |  0.761   |  0.946   |  0.766   |  0.789   |  0.769   |  0.340   |  0.414   |


Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):

|Model Name                         | Metric              | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts)    | LaBSE-ru-turbo    | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |

|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|---------------------:|----------------------:|

|CEDRClassification                 | Accuracy            |         0.368          |         0.358       |      0.418      |        0.451      |        0.401          |        0.423         |       **0.448**       |

|GeoreviewClassification            | Accuracy            |         0.397          |         0.400       |      0.406      |        0.438      |        0.447          |        0.461         |       **0.497**       |

|GeoreviewClusteringP2P             | V-measure           |         0.584          |         0.590       |      0.626      |      **0.644**    |        0.586          |        0.545         |         0.605         |

|HeadlineClassification             | Accuracy            |         0.772          |       **0.793**     |      0.633      |        0.688      |        0.732          |        0.757         |         0.758         |

|InappropriatenessClassification    | Accuracy            |       **0.646**        |         0.625       |      0.599      |        0.615      |        0.592          |        0.588         |         0.616         |

|KinopoiskClassification            | Accuracy            |         0.503          |         0.495       |      0.496      |        0.521      |        0.500          |        0.509         |       **0.566**       |

|RiaNewsRetrieval                   | NDCG@10             |         0.214          |         0.111       |      0.651      |        0.694      |        0.700          |        0.702         |       **0.807**       |

|RuBQReranking                      | MAP@10              |         0.561          |         0.468       |      0.688      |        0.687      |        0.715          |        0.720         |       **0.756**       |

|RuBQRetrieval                      | NDCG@10             |         0.298          |         0.124       |      0.622      |        0.657      |        0.685          |        0.696         |       **0.741**       |

|RuReviewsClassification            | Accuracy            |         0.589          |         0.583       |      0.599      |        0.632      |        0.612          |        0.630         |       **0.653**       |

|RuSTSBenchmarkSTS                  | Pearson correlation |         0.712          |         0.588       |      0.788      |        0.822      |        0.781          |        0.796         |       **0.831**       |

|RuSciBenchGRNTIClassification      | Accuracy            |         0.542          |         0.539       |      0.529      |        0.569      |        0.550          |        0.563         |       **0.582**       |

|RuSciBenchGRNTIClusteringP2P       | V-measure           |       **0.522**        |         0.504       |      0.486      |        0.517      |        0.511          |        0.516         |         0.520         |

|RuSciBenchOECDClassification       | Accuracy            |         0.438          |         0.430       |      0.406      |        0.440      |        0.427          |        0.423         |       **0.445**       |

|RuSciBenchOECDClusteringP2P        | V-measure           |       **0.473**        |         0.464       |      0.426      |        0.452      |        0.443          |        0.448         |         0.450         |

|SensitiveTopicsClassification      | Accuracy            |       **0.285**        |         0.280       |      0.262      |        0.272      |        0.228          |        0.234         |         0.257         |

|TERRaClassification                | Average Precision   |         0.520          |         0.502       |    **0.587**    |        0.585      |        0.551          |        0.550         |         0.584         |



|Model Name                         | Metric              | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts)    | LaBSE-ru-turbo    | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |

|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|----------------------:|---------------------:|

|Classification                     | Accuracy            |         0.554          |        0.552        |      0.524      |        0.558      |        0.551          |        0.561          |      **0.588**       |

|Clustering                         | V-measure           |         0.526          |        0.519        |      0.513      |      **0.538**    |        0.513          |        0.503          |        0.525         |

|MultiLabelClassification           | Accuracy            |         0.326          |        0.319        |      0.340      |      **0.361**    |        0.314          |        0.329          |        0.353         |

|PairClassification                 | Average Precision   |         0.520          |        0.502        |      0.587      |      **0.585**    |        0.551          |        0.550          |        0.584         |

|Reranking                          | MAP@10              |         0.561          |        0.468        |      0.688      |        0.687      |        0.715          |        0.720          |      **0.756**       |

|Retrieval                          | NDCG@10             |         0.256          |        0.118        |      0.637      |        0.675      |        0.697          |        0.699          |      **0.774**       |

|STS                                | Pearson correlation |         0.712          |        0.588        |      0.788      |        0.822      |        0.781          |        0.796          |      **0.831**       |

|Average                            | Average             |         0.494          |        0.438        |      0.582      |        0.604      |        0.588          |        0.594          |      **0.630**       |