File size: 3,546 Bytes

e457630
 
 
 
 
 
6dcf538
e457630
 
fdab3b0
e457630
 
 
fdab3b0
 
6dcf538
 
 
e457630
 
fdab3b0
e457630
 
 
 
 
 
f91371a
e457630
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdab3b0
 
e457630
fdab3b0
e457630
f91371a
 
 
 
 
 
 
 
e457630
fdab3b0
e457630
fdab3b0
e457630
 
7f770c3
 
fdab3b0
7f770c3
 
 
fdab3b0
 
 
7f770c3
 
cfe4d17
e457630
eb1bea4
e457630
fdab3b0
 
 
e457630
7f3232e
f91371a
 
fdab3b0
f91371a
 
fdab3b0

---
license: apache-2.0
language:
- ru
- en
library_name: transformers
pipeline_tag: feature-extraction
---

# RoBERTa-base

<!-- Provide a quick summary of what the model is/does. -->

Pretrained bidirectional encoder for russian language.
The model was trained using standard MLM objective on large text corpora including open social data.
See `Training Details` section for more information.

⚠️ This model contains only the encoder part without any pretrained head.


- **Developed by:** [deepvk](https://vk.com/deepvk)
- **Model type:** RoBERTa
- **Languages:** Mostly russian and small fraction of other languages
- **License:** Apache 2.0

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("deepvk/roberta-base")
model = AutoModel.from_pretrained("deepvk/roberta-base")

text = "Привет, мир!"

inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)
```

## Training Details

### Training Data

500 GB of raw text in total.
A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.

### Training Hyperparameters

| Argument           | Value                |
|--------------------|----------------------|
| Training regime    | fp16 mixed precision |
| Training framework | Fairseq              |
| Optimizer          | Adam                 |
| Adam betas         | 0.9,0.98             |
| Adam eps           | 1e-6                 |
| Num training steps | 500k                 |

The model was trained on a machine with 8xA100 for approximately 22 days. 

### Architecture details 


| Argument                | Value          |
|-------------------------|----------------|
|Encoder layers           | 12             |
|Encoder attention heads  | 12             |
|Encoder embed dim        | 768            |
|Encoder ffn embed dim    | 3,072          |
|Activation function      | GeLU           |
|Attention dropout        | 0.1            |
|Dropout                  | 0.1            |
|Max positions            | 512            |
|Vocab size               | 50266          |
|Tokenizer type           | Byte-level BPE |

## Evaluation

We evaluated the model on [Russian Super Glue](https://russiansuperglue.com/) dev set.
The best result in each task is marked in bold.
All models have the same size except the distilled version of DeBERTa.

| Model                                                                  | RCB       |  PARus | MuSeRC  | TERRa | RUSSE   | RWSD    | DaNetQA | Score     |
|------------------------------------------------------------------------|-----------|--------|---------|-------|---------|---------|---------|-----------|
| [vk-deberta-distill](https://huggingface.co/deepvk/deberta-v1-distill) | 0.433     |  0.56  | 0.625   | 0.59  | 0.943   | 0.569   | 0.726   | 0.635     |
| [vk-roberta-base](https://huggingface.co/deepvk/roberta-base)          | 0.46      |  0.56  | 0.679   | 0.769 | 0.960   | 0.569   | 0.658   | 0.665     |
| [vk-deberta-base](https://huggingface.co/deepvk/deberta-v1-base)       | 0.450     |**0.61**|**0.722**| 0.704 | 0.948   | 0.578   |**0.76** |**0.682**  |
| [vk-bert-base](https://huggingface.co/deepvk/bert-base-uncased)        | 0.467     |  0.57  | 0.587   | 0.704 | 0.953   |**0.583**| 0.737   | 0.657     |
| [sber-bert-base](https://huggingface.co/ai-forever/ruBert-base)        | **0.491** |**0.61**| 0.663   | 0.769 |**0.962**| 0.574   | 0.678   | 0.678     |