license: mit
language:
- en
- ru
tags:
- gpt3
- transformers
mGPT 13B
Multilingual language model. This model was trained on the 61 languages from 25 language families (see the list below).
Dataset
Model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia. Here is the table with number of tokens for each language in the pretraining corpus on a logarithmic scale:

Languages
Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo)
By language family
Language Family | Languages |
---|---|
Afro-Asiatic | Arabic (ar), Hebrew (he) |
Austro-Asiatic | Vietnamese (vi) |
Austronesian | Indonesian (id), Javanese (jv), Malay (ms), Tagalog (tl) |
Baltic | Latvian (lv), Lithuanian (lt) |
Basque | Basque (eu) |
Dravidian | Malayalam (ml), Tamil (ta), Telugu (te) |
Indo-European (Armenian) | Armenian (hy) |
Indo-European (Indo-Aryan) | Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur) |
Indo-European (Germanic) | Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv) |
Indo-European (Romance) | French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es) |
Indo-European (Greek) | Greek (el) |
Indo-European (Iranian) | Ossetian (os), Tajik (tg), Persian (fa) |
Japonic | Japanese (ja) |
Kartvelian | Georgian (ka) |
Koreanic | Korean (ko) |
Kra-Dai | Thai (th) |
Mongolic | Buryat (bxr), Kalmyk (xal), Mongolian (mn) |
Niger-Congo | Swahili (sw), Yoruba (yo) |
Slavic | Belarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl) |
Sino-Tibetan | Burmese (my) |
Turkic (Karluk) | Uzbek (uz) |
Turkic (Kipchak) | Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt) |
Turkic (Oghuz) | Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk) |
Turkic (Siberian) | Tuvan (tyv), Yakut (sax) |
Uralic | Estonian (et), Finnish (fi), Hungarian (hu) |