README.md · cointegrated/rubert-tiny2 at 6b38fe0c3c128908a1f86cc74e9cedda1514ac60

metadata

language:
  - ru
tags:
  - russian
  - fill-mask
  - pretraining
  - embeddings
  - masked-lm
  - tiny
  - feature-extraction
  - sentence-similarity
license: mit
widget:
  - text: Миниатюрная модель для [MASK] разных задач.

This is an updated version of cointegrated/rubert-tiny: a small Russian BERT-based encoder with high-quality sentence embeddings. This post in Russian gives more details.

The differences from the previous version include:

a larger vocabulary: 83828 tokens instead of 29564;
larger supported sequences: 2048 instead of 512;
sentence embeddings approximate LaBSE closer than before;
meaningful segment embeddings (tuned on the NLI task)
the model is focused only on Russian.

The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.

Sentence embeddings can be produced as follows:

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)