ai-forever
/

ru-en-RoSBERTa

@@ -4,55 +4,111 @@ language:
 - ru
 - en
 tags:
-- PyTorch
-- Transformers
 ---
-# ru-en RoBERTa large model for Sentence Embeddings in Russian and English.
-The model is described [in this article](<link of our arxiv>)
-Russian MTEB [metrics](<lin of our ruMTEB>)
-For better quality, use cls token embeddings.
-Also, use next prefixes for tasks:
-- For assimethric retrieval tasks like search/QuestAnsw: "search_query: "/"search_document: ".
-- NLI, NLU and paraphrasing tasks: "classification: ".
-- Title body/abstract and clasterization: "clustering: ".
-## Usage (HuggingFace Models Repository)
-You can use the model directly from the model repository to compute sentence embeddings:
 ```python
-from transformers import AutoTokenizer, AutoModel
 import torch
-#You might to use two variants of mode for embeddings creation:
-#CLS token embs or MEAN Pooling.
-#You can choose embs pooling with best quality for your downstream tasks.
-#Mean Pooling example - Take attention mask into account for correct averaging
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
-    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-    return sum_embeddings / sum_mask
-#Sentences we want sentence embeddings for
-sentences = ['Привет! Как твои дела?',
-             'А правда, что 42 твое любимое число?']
-#Load AutoModel from huggingface model repository
 tokenizer = AutoTokenizer.from_pretrained("ai-forever/ru-en-RoSBERTa")
 model = AutoModel.from_pretrained("ai-forever/ru-en-RoSBERTa")
-#Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors='pt')
-#Compute token embeddings
 with torch.no_grad():
-    model_output = model(**encoded_input)
-#In this case, mean pooling
-sentence_mean_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
-#In this case, cls "pooling"
-last_hidden_states = model_output[0]
-sentence_cls_embeddings = last_hidden_states[:,0]
-```

 - ru
 - en
 tags:
+- transformers
+- sentence-transformers
 ---
+# Model Card for ru-en-RoSBERTa
+The ru-en-RoSBERTa is a general text embedding model for Russian. The model is based on [ruRoBERTa](https://huggingface.co/ai-forever/ruRoberta-large) and fine-tuned with ~4M pairs of supervised, synthetic and unsupervised data in Russian and English. Tokenizer supports some English tokens from [RoBERTa](https://huggingface.co/FacebookAI/roberta-large) tokenizer.
+For more model details please refer to our [article](arxiv).
+## Usage
+The model can be used as is with prefixes. It is recommended to use CLS pooling. The choice of prefix and pooling depends on the task.
+We use the following basic rules to choose a prefix:
+- `"search_query: "` and `"search_document: "` prefixes are for answer or relevant paragraph retrieval
+- `"classification: "` prefix is for symmetric paraphrasing related tasks (STS, NLI, Bitext Mining)
+- `"clustering: "` prefix is for any tasks that rely on thematic features (topic classification, title-body retrieval)
+To better tailor the model to your needs, you can fine-tune it with relevant high-quality Russian and English datasets.
+Below are examples of texts encoding using the Transformers and SentenceTransformers libraries.
+### Transformers
 ```python
 import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+def pool(hidden_state, mask, pooling_method="cls"):
+    if pooling_method == "mean":
+        s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
+        d = mask.sum(axis=1, keepdim=True).float()
+        return s / d
+    elif pooling_method == "cls":
+        return hidden_state[:, 0]
+inputs = [
+    #
+    "classification: Он нам и <unk> не нужон ваш Интернет!",
+    "clustering: В Ярославской области разрешили работу бань, но без посетителей",
+    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
+    #
+    "classification: What a time to be alive!",
+    "clustering: Ярославским баням разрешили работать без посетителей",
+    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
+]
 tokenizer = AutoTokenizer.from_pretrained("ai-forever/ru-en-RoSBERTa")
 model = AutoModel.from_pretrained("ai-forever/ru-en-RoSBERTa")
+tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
+    outputs = model(**tokenized_inputs)
+embeddings = pool(
+    outputs.last_hidden_state,
+    tokenized_inputs["attention_mask"],
+    pooling_method="cls" # or try "mean"
+)
+embeddings = F.normalize(embeddings, p=2, dim=1)
+sim_scores = embeddings[:3] @ embeddings[3:].T
+print(sim_scores.diag().tolist())
+# [0.4796873927116394, 0.9409002065658569, 0.7761015892028809]
+```
+### SentenceTransformers
+```python
+from sentence_transformers import SentenceTransformer
+inputs = [
+    #
+    "classification: Он нам и <unk> не нужон ваш Интернет!",
+    "clustering: В Ярославской области разрешили работу бань, но без посетителей",
+    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
+    #
+    "classification: What a time to be alive!",
+    "clustering: Ярославским баням разрешили работать без посетителей",
+    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
+]
+# loads model with CLS pooling
+model = SentenceTransformer("ai-forever/ru-en-RoSBERTa")
+# embeddings are normalized by default
+embeddings = model.encode(inputs, convert_to_tensor=True)
+sim_scores = embeddings[:3] @ embeddings[3:].T
+print(sim_scores.diag().tolist())
+# [0.47968706488609314, 0.940900444984436, 0.7761018872261047]
+```
+## Citation
+TODO
+## Limitations
+The model is designed to process texts in Russian, the quality in English is unknown. Maximum input text length is limited to 512 tokens.