|
---
|
|
language:
|
|
- ru
|
|
|
|
pipeline_tag: sentence-similarity
|
|
|
|
tags:
|
|
- russian
|
|
- pretraining
|
|
- embeddings
|
|
- feature-extraction
|
|
- sentence-similarity
|
|
- sentence-transformers
|
|
- transformers
|
|
|
|
datasets:
|
|
- IlyaGusev/gazeta
|
|
- zloelias/lenta-ru
|
|
|
|
license: mit
|
|
base_model: cointegrated/LaBSE-en-ru
|
|
|
|
---
|
|
|
|
Модель BERT для расчетов эмбедингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембединга (768) и быстродействие.
|
|
|
|
|
|
## Использование:
|
|
```Python
|
|
from sentence_transformers import SentenceTransformer, util
|
|
|
|
model = SentenceTransformer('sergeyzh/LaBSE-ru-turbo')
|
|
|
|
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
|
|
embeddings = model.encode(sentences)
|
|
print(util.dot_score(embeddings, embeddings))
|
|
```
|
|
|
|
## Метрики
|
|
Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
|
|
|
|
| Model | CPU | GPU | size | Mean S | Mean S+W | dim |
|
|
|:-----------------------------------|----------:|---------:|---------:|----------:|-----------:|-------:|
|
|
| **sergeyzh/LaBSE-ru-turbo** | **133.40**|**15.30** |**490** | **0.789**| **0.703** | **768**|
|
|
| BAAI/bge-m3 | 523.40 | 22.50 | 2166 | 0.787 | 0.696 | 1024 |
|
|
| intfloat/multilingual-e5-large | 506.80 | 30.80 | 2136 | 0.780 | 0.686 | 1024 |
|
|
| intfloat/multilingual-e5-base | 130.61 | 14.39 | 1061 | 0.761 | 0.669 | 768 |
|
|
| sergeyzh/rubert-tiny-turbo | 5.51 | 3.25 | 111 | 0.749 | 0.667 | 312 |
|
|
| intfloat/multilingual-e5-small | 40.86 | 12.09 | 449 | 0.742 | 0.645 | 384 |
|
|
| cointegrated/LaBSE-en-ru | 133.40 | 15.30 | 490 | 0.739 | 0.667 | 768 |
|
|
|
|
| Model | STS | PI | NLI | SA | TI | IA | IC | ICX | NE1 | NE2 |
|
|
|:-----------------------------------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
|
|
| **sergeyzh/LaBSE-ru-turbo** |**0.864** |**0.748** |**0.490** |**0.814** |**0.974** |**0.806** |**0.815** |**0.802** |**0.320** |**0.401** |
|
|
| BAAI/bge-m3 | 0.864 | 0.749 | 0.510 | 0.819 | 0.973 | 0.792 | 0.809 | 0.783 | 0.240 | 0.422 |
|
|
| intfloat/multilingual-e5-large | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 | 0.798 | 0.819 | 0.773 | 0.224 | 0.374 |
|
|
| intfloat/multilingual-e5-base | 0.835 | 0.704 | 0.459 | 0.796 | 0.964 | 0.783 | 0.802 | 0.738 | 0.235 | 0.376 |
|
|
| sergeyzh/rubert-tiny-turbo | 0.828 | 0.722 | 0.476 | 0.787 | 0.955 | 0.757 | 0.780 | 0.685 | 0.305 | 0.373 |
|
|
| intfloat/multilingual-e5-small | 0.822 | 0.714 | 0.457 | 0.758 | 0.957 | 0.761 | 0.779 | 0.691 | 0.234 | 0.275 |
|
|
| cointegrated/LaBSE-en-ru | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 | 0.766 | 0.789 | 0.769 | 0.340 | 0.414 |
|
|
|
|
|
|
|