LaBSE-ru-turbo / README.md

sergeyzh

Update README.md

055975b verified about 2 months ago

preview code

raw

history blame contribute delete

No virus

10.1 kB

	---
	language:
	- ru

	pipeline_tag: sentence-similarity

	tags:
	- russian
	- pretraining
	- embeddings
	- feature-extraction
	- sentence-similarity
	- sentence-transformers
	- transformers

	datasets:
	- IlyaGusev/gazeta
	- zloelias/lenta-ru

	license: mit
	base_model: cointegrated/LaBSE-en-ru

	---

	Модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.


	## Использование:
	```Python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer('sergeyzh/LaBSE-ru-turbo')

	sentences = ["привет мир", "hello world", "здравствуй вселенная"]
	embeddings = model.encode(sentences)
	print(util.dot_score(embeddings, embeddings))
	```

	## Метрики
	Оценки модели на бенчмарке [encodechka](https://github.com/sergeyz-zh/encodechka):

	\| Model \| CPU \| GPU \| size \| Mean S \| Mean S+W \| dim \|
	\|:-----------------------------------\|----------:\|---------:\|---------:\|----------:\|-----------:\|-------:\|
	\| sergeyzh/LaBSE-ru-turbo \| 120.40 \| 8.05 \| 490 \| 0.789 \| 0.702 \| 768 \|
	\| BAAI/bge-m3 \| 523.40 \| 22.50 \| 2166 \| 0.787 \| 0.696 \| 1024 \|
	\| intfloat/multilingual-e5-large \| 506.80 \| 30.80 \| 2136 \| 0.780 \| 0.686 \| 1024 \|
	\| intfloat/multilingual-e5-base \| 130.61 \| 14.39 \| 1061 \| 0.761 \| 0.669 \| 768 \|
	\| [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) \| 5.51 \| 3.25 \| 111 \| 0.749 \| 0.667 \| 312 \|
	\| intfloat/multilingual-e5-small \| 40.86 \| 12.09 \| 449 \| 0.742 \| 0.645 \| 384 \|
	\| cointegrated/LaBSE-en-ru \| 120.40 \| 8.05 \| 490 \| 0.739 \| 0.667 \| 768 \|

	\| Model \| STS \| PI \| NLI \| SA \| TI \| IA \| IC \| ICX \| NE1 \| NE2 \|
	\|:-----------------------------------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|
	\| sergeyzh/LaBSE-ru-turbo \| 0.864 \| 0.748 \| 0.490 \| 0.814 \| 0.974 \| 0.806 \| 0.815 \| 0.801 \| 0.305 \| 0.404 \|
	\| BAAI/bge-m3 \| 0.864 \| 0.749 \| 0.510 \| 0.819 \| 0.973 \| 0.792 \| 0.809 \| 0.783 \| 0.240 \| 0.422 \|
	\| intfloat/multilingual-e5-large \| 0.862 \| 0.727 \| 0.473 \| 0.810 \| 0.979 \| 0.798 \| 0.819 \| 0.773 \| 0.224 \| 0.374 \|
	\| intfloat/multilingual-e5-base \| 0.835 \| 0.704 \| 0.459 \| 0.796 \| 0.964 \| 0.783 \| 0.802 \| 0.738 \| 0.235 \| 0.376 \|
	\| [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) \| 0.828 \| 0.722 \| 0.476 \| 0.787 \| 0.955 \| 0.757 \| 0.780 \| 0.685 \| 0.305 \| 0.373 \|
	\| intfloat/multilingual-e5-small \| 0.822 \| 0.714 \| 0.457 \| 0.758 \| 0.957 \| 0.761 \| 0.779 \| 0.691 \| 0.234 \| 0.275 \|
	\| cointegrated/LaBSE-en-ru \| 0.794 \| 0.659 \| 0.431 \| 0.761 \| 0.946 \| 0.766 \| 0.789 \| 0.769 \| 0.340 \| 0.414 \|


	Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) \| LaBSE-ru-turbo \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|---------------------:\|----------------------:\|
	\|CEDRClassification \| Accuracy \| 0.368 \| 0.358 \| 0.418 \| 0.451 \| 0.401 \| 0.423 \| 0.448 \|
	\|GeoreviewClassification \| Accuracy \| 0.397 \| 0.400 \| 0.406 \| 0.438 \| 0.447 \| 0.461 \| 0.497 \|
	\|GeoreviewClusteringP2P \| V-measure \| 0.584 \| 0.590 \| 0.626 \| 0.644 \| 0.586 \| 0.545 \| 0.605 \|
	\|HeadlineClassification \| Accuracy \| 0.772 \| 0.793 \| 0.633 \| 0.688 \| 0.732 \| 0.757 \| 0.758 \|
	\|InappropriatenessClassification \| Accuracy \| 0.646 \| 0.625 \| 0.599 \| 0.615 \| 0.592 \| 0.588 \| 0.616 \|
	\|KinopoiskClassification \| Accuracy \| 0.503 \| 0.495 \| 0.496 \| 0.521 \| 0.500 \| 0.509 \| 0.566 \|
	\|RiaNewsRetrieval \| NDCG@10 \| 0.214 \| 0.111 \| 0.651 \| 0.694 \| 0.700 \| 0.702 \| 0.807 \|
	\|RuBQReranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|RuBQRetrieval \| NDCG@10 \| 0.298 \| 0.124 \| 0.622 \| 0.657 \| 0.685 \| 0.696 \| 0.741 \|
	\|RuReviewsClassification \| Accuracy \| 0.589 \| 0.583 \| 0.599 \| 0.632 \| 0.612 \| 0.630 \| 0.653 \|
	\|RuSTSBenchmarkSTS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|RuSciBenchGRNTIClassification \| Accuracy \| 0.542 \| 0.539 \| 0.529 \| 0.569 \| 0.550 \| 0.563 \| 0.582 \|
	\|RuSciBenchGRNTIClusteringP2P \| V-measure \| 0.522 \| 0.504 \| 0.486 \| 0.517 \| 0.511 \| 0.516 \| 0.520 \|
	\|RuSciBenchOECDClassification \| Accuracy \| 0.438 \| 0.430 \| 0.406 \| 0.440 \| 0.427 \| 0.423 \| 0.445 \|
	\|RuSciBenchOECDClusteringP2P \| V-measure \| 0.473 \| 0.464 \| 0.426 \| 0.452 \| 0.443 \| 0.448 \| 0.450 \|
	\|SensitiveTopicsClassification \| Accuracy \| 0.285 \| 0.280 \| 0.262 \| 0.272 \| 0.228 \| 0.234 \| 0.257 \|
	\|TERRaClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) \| LaBSE-ru-turbo \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|----------------------:\|---------------------:\|
	\|Classification \| Accuracy \| 0.554 \| 0.552 \| 0.524 \| 0.558 \| 0.551 \| 0.561 \| 0.588 \|
	\|Clustering \| V-measure \| 0.526 \| 0.519 \| 0.513 \| 0.538 \| 0.513 \| 0.503 \| 0.525 \|
	\|MultiLabelClassification \| Accuracy \| 0.326 \| 0.319 \| 0.340 \| 0.361 \| 0.314 \| 0.329 \| 0.353 \|
	\|PairClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|
	\|Reranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|Retrieval \| NDCG@10 \| 0.256 \| 0.118 \| 0.637 \| 0.675 \| 0.697 \| 0.699 \| 0.774 \|
	\|STS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|Average \| Average \| 0.494 \| 0.438 \| 0.582 \| 0.604 \| 0.588 \| 0.594 \| 0.630 \|

	---
	language:
	- ru

	pipeline_tag: sentence-similarity

	tags:
	- russian
	- pretraining
	- embeddings
	- feature-extraction
	- sentence-similarity
	- sentence-transformers
	- transformers

	datasets:
	- IlyaGusev/gazeta
	- zloelias/lenta-ru

	license: mit
	base_model: cointegrated/LaBSE-en-ru

	---

	Модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.


	## Использование:
	```Python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer('sergeyzh/LaBSE-ru-turbo')

	sentences = ["привет мир", "hello world", "здравствуй вселенная"]
	embeddings = model.encode(sentences)
	print(util.dot_score(embeddings, embeddings))
	```

	## Метрики
	Оценки модели на бенчмарке [encodechka](https://github.com/sergeyz-zh/encodechka):

	\| Model \| CPU \| GPU \| size \| Mean S \| Mean S+W \| dim \|
	\|:-----------------------------------\|----------:\|---------:\|---------:\|----------:\|-----------:\|-------:\|
	\| sergeyzh/LaBSE-ru-turbo \| 120.40 \| 8.05 \| 490 \| 0.789 \| 0.702 \| 768 \|
	\| BAAI/bge-m3 \| 523.40 \| 22.50 \| 2166 \| 0.787 \| 0.696 \| 1024 \|
	\| intfloat/multilingual-e5-large \| 506.80 \| 30.80 \| 2136 \| 0.780 \| 0.686 \| 1024 \|
	\| intfloat/multilingual-e5-base \| 130.61 \| 14.39 \| 1061 \| 0.761 \| 0.669 \| 768 \|
	\| [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) \| 5.51 \| 3.25 \| 111 \| 0.749 \| 0.667 \| 312 \|
	\| intfloat/multilingual-e5-small \| 40.86 \| 12.09 \| 449 \| 0.742 \| 0.645 \| 384 \|
	\| cointegrated/LaBSE-en-ru \| 120.40 \| 8.05 \| 490 \| 0.739 \| 0.667 \| 768 \|

	\| Model \| STS \| PI \| NLI \| SA \| TI \| IA \| IC \| ICX \| NE1 \| NE2 \|
	\|:-----------------------------------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|:---------\|
	\| sergeyzh/LaBSE-ru-turbo \| 0.864 \| 0.748 \| 0.490 \| 0.814 \| 0.974 \| 0.806 \| 0.815 \| 0.801 \| 0.305 \| 0.404 \|
	\| BAAI/bge-m3 \| 0.864 \| 0.749 \| 0.510 \| 0.819 \| 0.973 \| 0.792 \| 0.809 \| 0.783 \| 0.240 \| 0.422 \|
	\| intfloat/multilingual-e5-large \| 0.862 \| 0.727 \| 0.473 \| 0.810 \| 0.979 \| 0.798 \| 0.819 \| 0.773 \| 0.224 \| 0.374 \|
	\| intfloat/multilingual-e5-base \| 0.835 \| 0.704 \| 0.459 \| 0.796 \| 0.964 \| 0.783 \| 0.802 \| 0.738 \| 0.235 \| 0.376 \|
	\| [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) \| 0.828 \| 0.722 \| 0.476 \| 0.787 \| 0.955 \| 0.757 \| 0.780 \| 0.685 \| 0.305 \| 0.373 \|
	\| intfloat/multilingual-e5-small \| 0.822 \| 0.714 \| 0.457 \| 0.758 \| 0.957 \| 0.761 \| 0.779 \| 0.691 \| 0.234 \| 0.275 \|
	\| cointegrated/LaBSE-en-ru \| 0.794 \| 0.659 \| 0.431 \| 0.761 \| 0.946 \| 0.766 \| 0.789 \| 0.769 \| 0.340 \| 0.414 \|


	Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) \| LaBSE-ru-turbo \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|---------------------:\|----------------------:\|
	\|CEDRClassification \| Accuracy \| 0.368 \| 0.358 \| 0.418 \| 0.451 \| 0.401 \| 0.423 \| 0.448 \|
	\|GeoreviewClassification \| Accuracy \| 0.397 \| 0.400 \| 0.406 \| 0.438 \| 0.447 \| 0.461 \| 0.497 \|
	\|GeoreviewClusteringP2P \| V-measure \| 0.584 \| 0.590 \| 0.626 \| 0.644 \| 0.586 \| 0.545 \| 0.605 \|
	\|HeadlineClassification \| Accuracy \| 0.772 \| 0.793 \| 0.633 \| 0.688 \| 0.732 \| 0.757 \| 0.758 \|
	\|InappropriatenessClassification \| Accuracy \| 0.646 \| 0.625 \| 0.599 \| 0.615 \| 0.592 \| 0.588 \| 0.616 \|
	\|KinopoiskClassification \| Accuracy \| 0.503 \| 0.495 \| 0.496 \| 0.521 \| 0.500 \| 0.509 \| 0.566 \|
	\|RiaNewsRetrieval \| NDCG@10 \| 0.214 \| 0.111 \| 0.651 \| 0.694 \| 0.700 \| 0.702 \| 0.807 \|
	\|RuBQReranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|RuBQRetrieval \| NDCG@10 \| 0.298 \| 0.124 \| 0.622 \| 0.657 \| 0.685 \| 0.696 \| 0.741 \|
	\|RuReviewsClassification \| Accuracy \| 0.589 \| 0.583 \| 0.599 \| 0.632 \| 0.612 \| 0.630 \| 0.653 \|
	\|RuSTSBenchmarkSTS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|RuSciBenchGRNTIClassification \| Accuracy \| 0.542 \| 0.539 \| 0.529 \| 0.569 \| 0.550 \| 0.563 \| 0.582 \|
	\|RuSciBenchGRNTIClusteringP2P \| V-measure \| 0.522 \| 0.504 \| 0.486 \| 0.517 \| 0.511 \| 0.516 \| 0.520 \|
	\|RuSciBenchOECDClassification \| Accuracy \| 0.438 \| 0.430 \| 0.406 \| 0.440 \| 0.427 \| 0.423 \| 0.445 \|
	\|RuSciBenchOECDClusteringP2P \| V-measure \| 0.473 \| 0.464 \| 0.426 \| 0.452 \| 0.443 \| 0.448 \| 0.450 \|
	\|SensitiveTopicsClassification \| Accuracy \| 0.285 \| 0.280 \| 0.262 \| 0.272 \| 0.228 \| 0.234 \| 0.257 \|
	\|TERRaClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| [LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) \| LaBSE-ru-turbo \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|----------------------:\|---------------------:\|
	\|Classification \| Accuracy \| 0.554 \| 0.552 \| 0.524 \| 0.558 \| 0.551 \| 0.561 \| 0.588 \|
	\|Clustering \| V-measure \| 0.526 \| 0.519 \| 0.513 \| 0.538 \| 0.513 \| 0.503 \| 0.525 \|
	\|MultiLabelClassification \| Accuracy \| 0.326 \| 0.319 \| 0.340 \| 0.361 \| 0.314 \| 0.329 \| 0.353 \|
	\|PairClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|
	\|Reranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|Retrieval \| NDCG@10 \| 0.256 \| 0.118 \| 0.637 \| 0.675 \| 0.697 \| 0.699 \| 0.774 \|
	\|STS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|Average \| Average \| 0.494 \| 0.438 \| 0.582 \| 0.604 \| 0.588 \| 0.594 \| 0.630 \|