ai-forever commited on
Commit
88eae7a
·
1 Parent(s): 915b59d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ru
6
+ tags:
7
+ - gpt3
8
+ - transformers
9
+ ---
10
+ # mGPT 13B
11
+
12
+ Multilingual language model. This model was trained on the **61** languages from **25** language families (see the list below).
13
+
14
+ ## Dataset
15
+
16
+ Model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia. Here is the table with number of tokens for each language in the pretraining corpus on a logarithmic scale:
17
+
18
+ <img src="https://i.imgur.com/KSMfVX1.png" width="1024px">
19
+
20
+ ## Languages
21
+
22
+ Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo)
23
+
24
+ ### By language family
25
+
26
+ <table><thead><tr><th>Language Family</th><th>Languages</th></tr></thead><tbody><tr><td>Afro-Asiatic</td><td>Arabic (ar), Hebrew (he)</td></tr><tr><td>Austro-Asiatic</td><td>Vietnamese (vi)</td></tr><tr><td>Austronesian</td><td>Indonesian (id), Javanese (jv), Malay (ms), Tagalog (tl)</td></tr><tr><td>Baltic</td><td>Latvian (lv), Lithuanian (lt)</td></tr><tr><td>Basque</td><td>Basque (eu)</td></tr><tr><td>Dravidian</td><td>Malayalam (ml), Tamil (ta), Telugu (te)</td></tr><tr><td>Indo-European (Armenian)</td><td>Armenian (hy)</td></tr><tr><td>Indo-European (Indo-Aryan)</td><td>Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur)</td></tr><tr><td>Indo-European (Germanic)</td><td>Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv)</td></tr><tr><td>Indo-European (Romance)</td><td>French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es)</td></tr><tr><td>Indo-European (Greek)</td><td>Greek (el)</td></tr><tr><td>Indo-European (Iranian)</td><td>Ossetian (os), Tajik (tg), Persian (fa)</td></tr><tr><td>Japonic</td><td>Japanese (ja)</td></tr><tr><td>Kartvelian</td><td>Georgian (ka)</td></tr><tr><td>Koreanic</td><td>Korean (ko)</td></tr><tr><td>Kra-Dai</td><td>Thai (th)</td></tr><tr><td>Mongolic</td><td>Buryat (bxr), Kalmyk (xal), Mongolian (mn)</td></tr><tr><td>Niger-Congo</td><td>Swahili (sw), Yoruba (yo)</td></tr><tr><td>Slavic</td><td>Belarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl)</td></tr><tr><td>Sino-Tibetan</td><td>Burmese (my)</td></tr><tr><td>Turkic (Karluk)</td><td>Uzbek (uz)</td></tr><tr><td>Turkic (Kipchak)</td><td>Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt)</td></tr><tr><td>Turkic (Oghuz)</td><td>Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk)</td></tr><tr><td>Turkic (Siberian)</td><td>Tuvan (tyv), Yakut (sax)</td></tr><tr><td>Uralic</td><td>Estonian (et), Finnish (fi), Hungarian (hu)</td></tr></tbody></table>