ai-forever
commited on
Commit
·
88eae7a
1
Parent(s):
915b59d
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- ru
|
6 |
+
tags:
|
7 |
+
- gpt3
|
8 |
+
- transformers
|
9 |
+
---
|
10 |
+
# mGPT 13B
|
11 |
+
|
12 |
+
Multilingual language model. This model was trained on the **61** languages from **25** language families (see the list below).
|
13 |
+
|
14 |
+
## Dataset
|
15 |
+
|
16 |
+
Model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia. Here is the table with number of tokens for each language in the pretraining corpus on a logarithmic scale:
|
17 |
+
|
18 |
+
<img src="https://i.imgur.com/KSMfVX1.png" width="1024px">
|
19 |
+
|
20 |
+
## Languages
|
21 |
+
|
22 |
+
Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo)
|
23 |
+
|
24 |
+
### By language family
|
25 |
+
|
26 |
+
<table><thead><tr><th>Language Family</th><th>Languages</th></tr></thead><tbody><tr><td>Afro-Asiatic</td><td>Arabic (ar), Hebrew (he)</td></tr><tr><td>Austro-Asiatic</td><td>Vietnamese (vi)</td></tr><tr><td>Austronesian</td><td>Indonesian (id), Javanese (jv), Malay (ms), Tagalog (tl)</td></tr><tr><td>Baltic</td><td>Latvian (lv), Lithuanian (lt)</td></tr><tr><td>Basque</td><td>Basque (eu)</td></tr><tr><td>Dravidian</td><td>Malayalam (ml), Tamil (ta), Telugu (te)</td></tr><tr><td>Indo-European (Armenian)</td><td>Armenian (hy)</td></tr><tr><td>Indo-European (Indo-Aryan)</td><td>Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur)</td></tr><tr><td>Indo-European (Germanic)</td><td>Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv)</td></tr><tr><td>Indo-European (Romance)</td><td>French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es)</td></tr><tr><td>Indo-European (Greek)</td><td>Greek (el)</td></tr><tr><td>Indo-European (Iranian)</td><td>Ossetian (os), Tajik (tg), Persian (fa)</td></tr><tr><td>Japonic</td><td>Japanese (ja)</td></tr><tr><td>Kartvelian</td><td>Georgian (ka)</td></tr><tr><td>Koreanic</td><td>Korean (ko)</td></tr><tr><td>Kra-Dai</td><td>Thai (th)</td></tr><tr><td>Mongolic</td><td>Buryat (bxr), Kalmyk (xal), Mongolian (mn)</td></tr><tr><td>Niger-Congo</td><td>Swahili (sw), Yoruba (yo)</td></tr><tr><td>Slavic</td><td>Belarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl)</td></tr><tr><td>Sino-Tibetan</td><td>Burmese (my)</td></tr><tr><td>Turkic (Karluk)</td><td>Uzbek (uz)</td></tr><tr><td>Turkic (Kipchak)</td><td>Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt)</td></tr><tr><td>Turkic (Oghuz)</td><td>Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk)</td></tr><tr><td>Turkic (Siberian)</td><td>Tuvan (tyv), Yakut (sax)</td></tr><tr><td>Uralic</td><td>Estonian (et), Finnish (fi), Hungarian (hu)</td></tr></tbody></table>
|