metadata

language:
  - ru
  - uk
  - be
  - kk
  - az
  - hy
  - ka
  - he
  - en
  - de
tags:
  - language classification
datasets:
  - open_subtitles
  - tatoeba
  - oscar

RoBERTa for Single Language Classification

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

data source	language
open_subtitles	ka, he, en, de
oscar	be, kk, az, hu
tatoeba	ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

index	class	f1-score	precision	recall	support
0	az	0.998	0.997	1.0	997
1	be	0.996	0.998	0.994	1004
2	de	0.976	0.966	0.987	979
3	en	0.976	0.986	0.967	1020
4	he	1.0	1.0	0.999	1001
5	hy	0.994	0.991	0.998	993
6	ka	0.999	0.999	0.999	1000
7	kk	0.996	0.998	0.993	1005
8	uk	0.982	0.997	0.968	1030
9	ru	0.982	0.968	0.997	971
10	macro_avg	0.99	0.99	0.99	10000
11	weighted avg	0.99	0.99	0.99	10000