metadata

language:
  - ru

distilrubert-tiny-cased-conversational

Conversational DistilRuBERT-tiny (Russian, cased, 2‑layer, 768‑hidden, 12‑heads, 107M parameters) was trained on OpenSubtitles[1], Dirty, Pikabu, and a Social Media segment of Taiga corpus[2] (as Conversational RuBERT). It can be considered as tiny copy of our Conversational DistilRuBERT-base

Our DistilRuBERT-tiny was highly inspired by [3], [4]. Namely, we used

KL loss (between teacher and student output logits)
MLM loss (between tokens labels and student output logits)
Cosine embedding loss (between mean of six consecutive hidden states from teacher's encoder and one hidden state of the student)
MSE loss (between six consecutive attention maps from teacher's encoder and one attention map of the student)

[1]: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

[2]: Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.

[3]: Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[4]: https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation