--- language: - ru pipeline_tag: sentence-similarity tags: - russian - pretraining - embeddings - feature-extraction - sentence-similarity - sentence-transformers - transformers datasets: - IlyaGusev/gazeta - zloelias/lenta-ru license: mit base_model: cointegrated/LaBSE-en-ru --- Модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие. ## Использование: ```Python from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('sergeyzh/LaBSE-ru-turbo') sentences = ["привет мир", "hello world", "здравствуй вселенная"] embeddings = model.encode(sentences) print(util.dot_score(embeddings, embeddings)) ``` ## Метрики Оценки модели на бенчмарке [encodechka](https://github.com/sergeyz-zh/encodechka): | Model | CPU | GPU | size | Mean S | Mean S+W | dim | |:-----------------------------------|----------:|---------:|---------:|----------:|-----------:|-------:| | **sergeyzh/LaBSE-ru-turbo** | 120.40 | 8.05 | 490 | 0.789 | 0.702 | 768 | | BAAI/bge-m3 | 523.40 | 22.50 | 2166 | 0.787 | 0.696 | 1024 | | intfloat/multilingual-e5-large | 506.80 | 30.80 | 2136 | 0.780 | 0.686 | 1024 | | intfloat/multilingual-e5-base | 130.61 | 14.39 | 1061 | 0.761 | 0.669 | 768 | | [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) | 5.51 | 3.25 | 111 | 0.749 | 0.667 | 312 | | intfloat/multilingual-e5-small | 40.86 | 12.09 | 449 | 0.742 | 0.645 | 384 | | cointegrated/LaBSE-en-ru | 120.40 | 8.05 | 490 | 0.739 | 0.667 | 768 | | Model | STS | PI | NLI | SA | TI | IA | IC | ICX | NE1 | NE2 | |:-----------------------------------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:---------| | **sergeyzh/LaBSE-ru-turbo** | 0.864 | 0.748 | 0.490 | 0.814 | 0.974 | 0.806 | 0.815 | 0.801 | 0.305 | 0.404 | | BAAI/bge-m3 | 0.864 | 0.749 | 0.510 | 0.819 | 0.973 | 0.792 | 0.809 | 0.783 | 0.240 | 0.422 | | intfloat/multilingual-e5-large | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 | 0.798 | 0.819 | 0.773 | 0.224 | 0.374 | | intfloat/multilingual-e5-base | 0.835 | 0.704 | 0.459 | 0.796 | 0.964 | 0.783 | 0.802 | 0.738 | 0.235 | 0.376 | | [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) | 0.828 | 0.722 | 0.476 | 0.787 | 0.955 | 0.757 | 0.780 | 0.685 | 0.305 | 0.373 | | intfloat/multilingual-e5-small | 0.822 | 0.714 | 0.457 | 0.758 | 0.957 | 0.761 | 0.779 | 0.691 | 0.234 | 0.275 | | cointegrated/LaBSE-en-ru | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 | 0.766 | 0.789 | 0.769 | 0.340 | 0.414 | Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/): |Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) | LaBSE-ru-turbo | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large | |:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|---------------------:|----------------------:| |CEDRClassification | Accuracy | 0.368 | 0.358 | 0.418 | 0.451 | 0.401 | 0.423 | **0.448** | |GeoreviewClassification | Accuracy | 0.397 | 0.400 | 0.406 | 0.438 | 0.447 | 0.461 | **0.497** | |GeoreviewClusteringP2P | V-measure | 0.584 | 0.590 | 0.626 | **0.644** | 0.586 | 0.545 | 0.605 | |HeadlineClassification | Accuracy | 0.772 | **0.793** | 0.633 | 0.688 | 0.732 | 0.757 | 0.758 | |InappropriatenessClassification | Accuracy | **0.646** | 0.625 | 0.599 | 0.615 | 0.592 | 0.588 | 0.616 | |KinopoiskClassification | Accuracy | 0.503 | 0.495 | 0.496 | 0.521 | 0.500 | 0.509 | **0.566** | |RiaNewsRetrieval | NDCG@10 | 0.214 | 0.111 | 0.651 | 0.694 | 0.700 | 0.702 | **0.807** | |RuBQReranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** | |RuBQRetrieval | NDCG@10 | 0.298 | 0.124 | 0.622 | 0.657 | 0.685 | 0.696 | **0.741** | |RuReviewsClassification | Accuracy | 0.589 | 0.583 | 0.599 | 0.632 | 0.612 | 0.630 | **0.653** | |RuSTSBenchmarkSTS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** | |RuSciBenchGRNTIClassification | Accuracy | 0.542 | 0.539 | 0.529 | 0.569 | 0.550 | 0.563 | **0.582** | |RuSciBenchGRNTIClusteringP2P | V-measure | **0.522** | 0.504 | 0.486 | 0.517 | 0.511 | 0.516 | 0.520 | |RuSciBenchOECDClassification | Accuracy | 0.438 | 0.430 | 0.406 | 0.440 | 0.427 | 0.423 | **0.445** | |RuSciBenchOECDClusteringP2P | V-measure | **0.473** | 0.464 | 0.426 | 0.452 | 0.443 | 0.448 | 0.450 | |SensitiveTopicsClassification | Accuracy | **0.285** | 0.280 | 0.262 | 0.272 | 0.228 | 0.234 | 0.257 | |TERRaClassification | Average Precision | 0.520 | 0.502 | **0.587** | 0.585 | 0.551 | 0.550 | 0.584 | |Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) | LaBSE-ru-turbo | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large | |:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|----------------------:|---------------------:| |Classification | Accuracy | 0.554 | 0.552 | 0.524 | 0.558 | 0.551 | 0.561 | **0.588** | |Clustering | V-measure | 0.526 | 0.519 | 0.513 | **0.538** | 0.513 | 0.503 | 0.525 | |MultiLabelClassification | Accuracy | 0.326 | 0.319 | 0.340 | **0.361** | 0.314 | 0.329 | 0.353 | |PairClassification | Average Precision | 0.520 | 0.502 | 0.587 | **0.585** | 0.551 | 0.550 | 0.584 | |Reranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** | |Retrieval | NDCG@10 | 0.256 | 0.118 | 0.637 | 0.675 | 0.697 | 0.699 | **0.774** | |STS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** | |Average | Average | 0.494 | 0.438 | 0.582 | 0.604 | 0.588 | 0.594 | **0.630** |