language: | |
- ru | |
- uk | |
- be | |
- kk | |
- az | |
- hy | |
- ka | |
- he | |
- en | |
- de | |
- multilingual | |
tags: | |
- language classification | |
datasets: | |
- open_subtitles | |
- tatoeba | |
- oscar | |
# RoBERTa for Single Language Classification | |
## Training | |
RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language). | |
| data source | language | | |
|-----------------|----------------| | |
| open_subtitles | ka, he, en, de | | |
| oscar | be, kk, az, hu | | |
| tatoeba | ru, uk | | |
## Validation | |
The metrics obtained from validation on the another part of dataset (~1k samples per language). | |
|index|class|f1-score|precision|recall|support| | |
|---|---|---|---|---|---| | |
|0|az|0\.998|0\.997|1\.0|997| | |
|1|be|0\.996|0\.998|0\.994|1004| | |
|2|de|0\.976|0\.966|0\.987|979| | |
|3|en|0\.976|0\.986|0\.967|1020| | |
|4|he|1\.0|1\.0|0\.999|1001| | |
|5|hy|0\.994|0\.991|0\.998|993| | |
|6|ka|0\.999|0\.999|0\.999|1000| | |
|7|kk|0\.996|0\.998|0\.993|1005| | |
|8|uk|0\.982|0\.997|0\.968|1030| | |
|9|ru|0\.982|0\.968|0\.997|971| | |
|10|macro\_avg|0\.99|0\.99|0\.99|10000| | |
|11|weighted avg|0\.99|0\.99|0\.99|10000| |