USER-bge-m3 / README.md

Corrected a typo in the 'TatonkaHF/bge-m3_en_ru' model name in the Initialization section. (#2)

0cc6cfe verified 4 months ago

9.34 kB

	---
	language:
	- ru
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	widget: []
	pipeline_tag: sentence-similarity
	license: apache-2.0
	datasets:
	- deepvk/ru-HNP
	- deepvk/ru-WANLI
	- Shitao/bge-m3-data
	- RussianNLP/russian_super_glue
	- reciTAL/mlsum
	- Milana/russian_keywords
	- IlyaGusev/gazeta
	- d0rj/gsm8k-ru
	- bragovo/dsum_ru
	- CarlBrendt/Summ_Dialog_News
	---

	# USER-bge-m3


	Universal Sentence Encoder for Russian (USER) is a [sentence-transformer](https://www.SBERT.net) model for extracting embeddings exclusively for Russian language.
	It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.


	This model is initialized from [`TatonkaHF/bge-m3_en_ru`](https://huggingface.co/TatonkaHF/bge-m3_en_ru) which is shrinked version of [`baai/bge-m3`](https://huggingface.co/BAAI/bge-m3) model and trained to work mainly with the Russian language. Its quality on other languages was not evaluated.


	## Usage


	Using this model becomes easy when you have [`sentence-transformers`](https://www.SBERT.net) installed:


	```
	pip install -U sentence-transformers
	```


	Then you can use the model like this:


	```python
	from sentence_transformers import SentenceTransformer


	input_texts = [
	"Когда был спущен на воду первый миноносец «Спокойный»?",
	"Есть ли нефть в Удмуртии?",
	"Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
	"Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
	]


	model = SentenceTransformer("deepvk/USER-bge-m3")
	embeddings = model.encode(input_texts, normalize_embeddings=True)
	```


	However, you can use model directly with [`transformers`](https://huggingface.co/docs/transformers/en/index)


	```python
	import torch.nn.functional as F
	from torch import Tensor, inference_mode
	from transformers import AutoTokenizer, AutoModel


	input_texts = [
	"Когда был спущен на воду первый миноносец «Спокойный»?",
	"Есть ли нефть в Удмуртии?",
	"Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
	"Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
	]


	tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-bge-m3")
	model = AutoModel.from_pretrained("deepvk/USER-bge-m3")
	model.eval()


	encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
	with torch.no_grad():
	model_output = model(**encoded_input)
	# Perform pooling. In this case, cls pooling.
	sentence_embeddings = model_output[0][:, 0]

	# normalize embeddings
	sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)

	# [[0.5567, 0.3014],
	# [0.1701, 0.7122]]
	scores = (sentence_embeddings[:2] @ sentence_embeddings[2:].T)
	```

	Also, you can use native [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) library for evaluation. Usage is described in [`bge-m3` model card](https://huggingface.co/BAAI/bge-m3).


	# Training Details


	We follow the [`USER-base`](https://huggingface.co/deepvk/USER-base) model training algorithm, with several changes as we use different backbone.


	Initialization: [`TatonkaHF/bge-m3_en_ru`](https://huggingface.co/TatonkaHF/bge-m3_en_ru) – shrinked version of [`baai/bge-m3`](https://huggingface.co/BAAI/bge-m3) to support only Russian and English tokens.


	Fine-tuning: Supervised fine-tuning two different models based on data symmetry and then merging via [`LM-Cocktail`](https://arxiv.org/abs/2311.13534):


	1. Since we split the data, we could additionally apply the [AnglE loss](https://arxiv.org/abs/2309.12871) to the symmetric model, which enhances performance on symmetric tasks.


	2. Finally, we added the original `bge-m3` model to the two obtained models to prevent catastrophic forgetting, tuning the weights for the merger using `LM-Cocktail` to produce the final model, USER-bge-m3.


	### Dataset


	During model development, we additional collect 2 datasets:
	[`deepvk/ru-HNP`](https://huggingface.co/datasets/deepvk/ru-HNP) and
	[`deepvk/ru-WANLI`](https://huggingface.co/datasets/deepvk/ru-WANLI).


	\| Symmetric Dataset \| Size \| Asymmetric Dataset \| Size \|
	\|-------------------\|-------\|--------------------\|------\|
	\| AllNLI \| 282 644 \| [MIRACL](https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main) \| 10 000 \|
	\| [MedNLI](https://github.com/jgc128/mednli) \| 3 699 \| [MLDR](https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main) \| 1 864 \|
	\| [RCB](https://huggingface.co/datasets/RussianNLP/russian_super_glue) \| 392 \| [Lenta](https://github.com/yutkin/Lenta.Ru-News-Dataset) \| 185 972 \|
	\| [Terra](https://huggingface.co/datasets/RussianNLP/russian_super_glue) \| 1 359 \| [Mlsum](https://huggingface.co/datasets/reciTAL/mlsum) \| 51 112 \|
	\| [Tapaco](https://huggingface.co/datasets/tapaco) \| 91 240 \| [Mr-TyDi](https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main) \| 536 600 \|
	\| [deepvk/ru-WANLI](https://huggingface.co/datasets/deepvk/ru-WANLI) \| 35 455 \| [Panorama](https://huggingface.co/datasets/its5Q/panorama) \| 11 024 \|
	\| [deepvk/ru-HNP](https://huggingface.co/datasets/deepvk/ru-HNP) \| 500 000 \| [PravoIsrael](https://huggingface.co/datasets/TarasHu/pravoIsrael) \| 26 364 \|
	\| \| \| [Xlsum](https://huggingface.co/datasets/csebuetnlp/xlsum) \| 124 486 \|
	\| \| \| [Fialka-v1](https://huggingface.co/datasets/0x7o/fialka-v1) \| 130 000 \|
	\| \| \| [RussianKeywords](https://huggingface.co/datasets/Milana/russian_keywords) \| 16 461 \|
	\| \| \| [Gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta) \| 121 928 \|
	\| \| \| [Gsm8k-ru](https://huggingface.co/datasets/d0rj/gsm8k-ru) \| 7 470 \|
	\| \| \| [DSumRu](https://huggingface.co/datasets/bragovo/dsum_ru) \| 27 191 \|
	\| \| \| [SummDialogNews](https://huggingface.co/datasets/CarlBrendt/Summ_Dialog_News) \| 75 700 \|




	Total positive pairs: 2,240,961
	Total negative pairs: 792,644 (negative pairs from AIINLI, MIRACL, deepvk/ru-WANLI, deepvk/ru-HNP)


	For all labeled datasets, we only use its training set for fine-tuning.
	For datasets Gazeta, Mlsum, Xlsum: pairs (title/text) and (title/summary) are combined and used as asymmetric data.


	`AllNLI` is an translated to Russian combination of SNLI, MNLI and ANLI.


	## Experiments


	We compare our mode with the basic [`baai/bge-m3`](https://huggingface.co/BAAI/bge-m3) on the [`encodechka`](https://github.com/avidale/encodechka) benchmark.
	In addition, we evaluate model on the russian subset of [`MTEB`](https://github.com/embeddings-benchmark/mteb) on Classification, Reranking, Multilabel Classification, STS, Retrieval, and PairClassification tasks.
	We use validation scripts from the official repositories for each of the tasks.


	Results on encodechka:
	\| Model \| Mean S \| Mean S+W \| STS \| PI \| NLI \| SA \| TI \| IA \| IC \| ICX \| NE1 \| NE2 \|
	\|-------------\|--------\|----------\|------\|------\|------\|------\|------\|------\|------\|------\|------\|------\|
	\| [`baai/bge-m3`](https://huggingface.co/BAAI/bge-m3) \| 0.787 \| 0.696 \| 0.86 \| 0.75 \| 0.51 \| 0.82 \| 0.97 \| 0.79 \| 0.81 \| 0.78 \| 0.24 \| 0.42 \|
	\| `USER-bge-m3` \| 0.799 \| 0.709 \| 0.87 \| 0.76 \| 0.58 \| 0.82 \| 0.97 \| 0.79 \| 0.81 \| 0.78 \| 0.28 \| 0.43 \|


	Results on MTEB:


	\| Type \| [`baai/bge-m3`](https://huggingface.co/BAAI/bge-m3) \| `USER-bge-m3` \|
	\|---------------------------\|--------\|-------------\|
	\| Average (30 datasets) \| 0.689 \| 0.706 \|
	\| Classification Average (12 datasets) \| 0.571 \| 0.594 \|
	\| Reranking Average (2 datasets) \| 0.698 \| 0.688 \|
	\| MultilabelClassification (2 datasets) \| 0.343 \| 0.359 \|
	\| STS Average (4 datasets) \| 0.735 \| 0.753 \|
	\| Retrieval Average (6 datasets) \| 0.945 \| 0.934 \|
	\| PairClassification Average (4 datasets) \| 0.784 \| 0.833 \|


	## Limitations


	We did not thoroughly evaluate the model's ability for sparse and multi-vec encoding.


	## Citations


	```
	@misc{deepvk2024user,
	title={USER: Universal Sentence Encoder for Russian},
	author={Malashenko, Boris and Zemerov, Anton and Spirin, Egor},
	url={https://huggingface.co/datasets/deepvk/USER-base},
	publisher={Hugging Face}
	year={2024},
	}
	```