DeBERTa-distill
Pretrained bidirectional encoder for russian language.
The model was trained using standard MLM objective on large text corpora including open social data.
See Training Details
section for more information.
⚠️ This model contains only the encoder part without any pretrained head.
- Developed by: deepvk
- Model type: DeBERTa
- Languages: Mostly russian and small fraction of other languages
- License: Apache 2.0
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("deepvk/deberta-v1-distill")
model = AutoModel.from_pretrained("deepvk/deberta-v1-distill")
text = "Привет, мир!"
inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)
Training Details
Training Data
400 GB of filtered and deduplicated texts in total. A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.
Deduplication procedure
- Calculate shingles with size of 5
- Calculate MinHash with 100 seeds → for every sample (text) have a hash of size 100
- Split every hash into 10 buckets → every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash → we have 10 hashes for every sample
- For each bucket find duplicates: find samples which have the same hash → calculate pair-wise jaccard similarity → if the similarity is >0.7 than it's a duplicate
- Gather duplicates from all the buckets and filter
Training Hyperparameters
Argument | Value |
---|---|
Training regime | fp16 mixed precision |
Optimizer | AdamW |
Adam betas | 0.9,0.98 |
Adam eps | 1e-6 |
Weight decay | 1e-2 |
Batch size | 3840 |
Num training steps | 100k |
Num warm-up steps | 5k |
LR scheduler | Cosine |
LR | 5e-4 |
Gradient norm | 1.0 |
The model was trained on a machine with 8xA100 for approximately 15 days.
Architecture details
Argument | Value |
---|---|
Encoder layers | 6 |
Encoder attention heads | 12 |
Encoder embed dim | 768 |
Encoder ffn embed dim | 3,072 |
Activation function | GeLU |
Attention dropout | 0.1 |
Dropout | 0.1 |
Max positions | 512 |
Vocab size | 50266 |
Tokenizer type | Byte-level BPE |
Distilation
In our distillation procedure, we follow SANH et al.. The student is initialized from the teacher by taking only every second layer. We use the MLM loss and CE loss with coefficients of 0.5.
Evaluation
We evaluated the model on Russian Super Glue dev set. The best result in each task is marked in bold. All models have the same size except the distilled version of DeBERTa.
Model | RCB | PARus | MuSeRC | TERRa | RUSSE | RWSD | DaNetQA | Score |
---|---|---|---|---|---|---|---|---|
vk-deberta-distill | 0.433 | 0.56 | 0.625 | 0.59 | 0.943 | 0.569 | 0.726 | 0.635 |
vk-roberta-base | 0.46 | 0.56 | 0.679 | 0.769 | 0.960 | 0.569 | 0.658 | 0.665 |
vk-deberta-base | 0.450 | 0.61 | 0.722 | 0.704 | 0.948 | 0.578 | 0.76 | 0.682 |
vk-bert-base | 0.467 | 0.57 | 0.587 | 0.704 | 0.953 | 0.583 | 0.737 | 0.657 |
sber-bert-base | 0.491 | 0.61 | 0.663 | 0.769 | 0.962 | 0.574 | 0.678 | 0.678 |
- Downloads last month
- 349