metadata

library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
license: apache-2.0
datasets:
  - deepvk/ru-HNP
  - deepvk/ru-HNP
  - Shitao/bge-m3-data
  - RussianNLP/russian_super_glue
  - reciTAL/mlsum
  - Helsinki-NLP/opus-100
  - Helsinki-NLP/bible_para
  - d0rj/rudetoxifier_data_detox
  - s-nlp/ru_paradetox
  - Milana/russian_keywords
  - IlyaGusev/gazeta
  - d0rj/gsm8k-ru
  - bragovo/dsum_ru
  - CarlBrendt/Summ_Dialog_News
language:
  - ru

USER-base

Universal Sentence Encoder for Russian (USER) is a sentence-transformer model for extracting embeddings exclusively for Russian language. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This model is initialized from deepvk/deberta-v1-base and trained to work exclusively with the Russian language. Its quality on other languages was not evaluated.

Usage

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = [
  "query: Когда был спущен на воду первый миноносец «Спокойный»?",
  "query: Есть ли нефть в Удмуртии?",
  "passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

model = SentenceTransformer("deepvk/USER-base")
embeddings = model.encode(input_texts, normalize_embeddings=True)

However, you can use model directly with transformers

import torch.nn.functional as F
from torch import Tensor, inference_mode
from transformers import AutoTokenizer, AutoModel

def average_pool(
  last_hidden_states: Tensor,
  attention_mask: Tensor
) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(
      ~attention_mask[..., None].bool(), 0.0
    )
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = [
  "query: Когда был спущен на воду первый миноносец «Спокойный»?",
  "query: Есть ли нефть в Удмуртии?",
  "passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
  "passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-base")
model = AutoModel.from_pretrained("deepvk/USER-base")

batch_dict = tokenizer(
  input_texts, padding=True, truncation=True, return_tensors="pt"
)
with inference_mode():
  outputs = model(**batch_dict)
  embeddings = average_pool(
    outputs.last_hidden_state, batch_dict["attention_mask"]
  )
  embeddings = F.normalize(embeddings, p=2, dim=1)

# Scores for query-passage
scores = (embeddings[:2] @ embeddings[2:].T) * 100
# [[55.86, 30.95],
#  [22.82, 59.46]]
print(scores.round(decimals=2))

⚠️ Attention ⚠️

Each input text should start with "query: " or "passage: ". For tasks other than retrieval, you can simply use the "query: " prefix.

Training Details

We aimed to follow the bge-base-en model training algorithm, but we made several improvements along the way.

Initialization: deepvk/deberta-v1-base

First-stage: Contrastive pre-training with weak supervision on the Russian part of mMarco corpus.

Second-stage: Supervised fine-tuning two different models based on data symmetry and then merging via LM-Cocktail:

We modified the instruction design by simplifying the multilingual approach to facilitate easier inference. For symmetric data (S1, S2), we used the instructions: "query: S1" and "query: S2", and for asymmetric data, we used "query: S1" with "passage: S2".
Since we split the data, we could additionally apply the AnglE loss to the symmetric model, which enhances performance on symmetric tasks.
Finally, we combined the two models, tuning the weights for the merger using LM-Cocktail to produce the final model, USER.

Dataset

During model development, we additional collect 2 datasets: deepvk/ru-HNP and deepvk/ru-WANLI.

Symmetric Dataset	Size	Asymmetric Dataset	Size
AllNLI	282 644	MIRACL	10 000
MedNLI	3 699	MLDR	1 864
RCB	392	Lenta	185 972
Terra	1 359	Mlsum	51 112
Tapaco	91 240	Mr-TyDi	536 600
Opus100	1 000 000	Panorama	11 024
BiblePar	62 195	PravoIsrael	26 364
RudetoxifierDataDetox	31 407	Xlsum	124 486
RuParadetox	11 090	Fialka-v1	130 000
deepvk/ru-WANLI	35 455	RussianKeywords	16 461
deepvk/ru-HNP	500 000	Gazeta	121 928
		Gsm8k-ru	7 470
		DSumRu	27 191
		SummDialogNews	75 700

Total positive pairs: 3,352,653
Total negative pairs: 792,644 (negative pairs from AIINLI, MIRACL, deepvk/ru-WANLI, deepvk/ru-HNP)

For all labeled datasets, we only use its training set for fine-tuning. For datasets Gazeta, Mlsum, Xlsum: pairs (title/text) and (title/summary) are combined and used as asymmetric data.

AllNLI is an translated to Russian combination of SNLI, MNLI, and ANLI.

Experiments

As a baseline, we chose the current top models from the encodechka leaderboard table. In addition, we evaluate model on the russian subset of MTEB, which include 10 tasks. Unfortunately, we could not validate the bge-m3 on some MTEB tasks, specifically clustering, due to excessive computational resources. Besides these two benchmarks, we also evaluated the models on the MIRACL. All experiments were conducted using NVIDIA TESLA A100 40 GB GPU. We use validation scripts from the official repositories for each of the tasks.

Model	Size (w/o Embeddings)	Encodechka (Mean S)	MTEB (Mean Ru)	Miracl (Recall@100)
`bge-m3`	303	0.786	0.694	0.959
`multilingual-e5-large`	303	0.78	0.665	0.927
`USER` (this model)	85	0.772	0.666	0.763
`paraphrase-multilingual-mpnet-base-v2`	85	0.76	0.625	0.149
`multilingual-e5-base`	85	0.756	0.645	0.915
`LaBSE-en-ru`	85	0.74	0.599	0.327
`sn-xlm-roberta-base-snli-mnli-anli-xnli`	85	0.74	0.593	0.08

Model sizes are shown, with larger models visually distinct from the others. Absolute leaders in the metrics are highlighted in bold, and the leaders among models of our size is underlined.

In this way, our solution outperforms all other models of the same size on both Encodechka and MTEB. Given that the model is slightly underperforming in retrieval tasks relative to existing solutions, we aim to address this in our future research.

FAQ

Do I need to add the prefix "query: " and "passage: " to input texts?

Yes, this is how the model is trained, otherwise you will see a performance degradation. Here are some rules of thumb:

Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

Citations

@misc{deepvk2024user,
    title={USER: Universal Sentence Encoder for Russian},
    author={Malashenko, Boris and  Zemerov, Anton and Spirin, Egor},
    url={https://huggingface.co/datasets/deepvk/USER-base},
    publisher={Hugging Face}
    year={2024},
}