LaBSE-ru-turbo / README.md
sergeyzh's picture
Update README.md
754d90e verified
|
raw
history blame
3.59 kB
metadata
language:
  - ru
pipeline_tag: sentence-similarity
tags:
  - russian
  - pretraining
  - embeddings
  - feature-extraction
  - sentence-similarity
  - sentence-transformers
  - transformers
datasets:
  - IlyaGusev/gazeta
  - zloelias/lenta-ru
license: mit
base_model: cointegrated/LaBSE-en-ru

Модель BERT для расчетов эмбедингов предложений на русском языке. Модель основана на cointegrated/LaBSE-en-ru - имеет аналогичные размеры контекста (512), ембединга (768) и быстродействие.

Использование:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sergeyzh/LaBSE-ru-turbo')

sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))

Метрики

Оценки модели на бенчмарке encodechka:

Model CPU GPU size Mean S Mean S+W dim
sergeyzh/LaBSE-ru-turbo 120.40 8.05 490 0.789 0.702 768
BAAI/bge-m3 523.40 22.50 2166 0.787 0.696 1024
intfloat/multilingual-e5-large 506.80 30.80 2136 0.780 0.686 1024
intfloat/multilingual-e5-base 130.61 14.39 1061 0.761 0.669 768
sergeyzh/rubert-tiny-turbo 5.51 3.25 111 0.749 0.667 312
intfloat/multilingual-e5-small 40.86 12.09 449 0.742 0.645 384
cointegrated/LaBSE-en-ru 120.40 8.05 490 0.739 0.667 768
Model STS PI NLI SA TI IA IC ICX NE1 NE2
sergeyzh/LaBSE-ru-turbo 0.864 0.748 0.490 0.814 0.974 0.806 0.815 0.801 0.305 0.404
BAAI/bge-m3 0.864 0.749 0.510 0.819 0.973 0.792 0.809 0.783 0.240 0.422
intfloat/multilingual-e5-large 0.862 0.727 0.473 0.810 0.979 0.798 0.819 0.773 0.224 0.374
intfloat/multilingual-e5-base 0.835 0.704 0.459 0.796 0.964 0.783 0.802 0.738 0.235 0.376
sergeyzh/rubert-tiny-turbo 0.828 0.722 0.476 0.787 0.955 0.757 0.780 0.685 0.305 0.373
intfloat/multilingual-e5-small 0.822 0.714 0.457 0.758 0.957 0.761 0.779 0.691 0.234 0.275
cointegrated/LaBSE-en-ru 0.794 0.659 0.431 0.761 0.946 0.766 0.789 0.769 0.340 0.414