mGPT-13B / README.md
ai-forever's picture
Create README.md
88eae7a
|
raw
history blame
3.23 kB
metadata
license: mit
language:
  - en
  - ru
tags:
  - gpt3
  - transformers

mGPT 13B

Multilingual language model. This model was trained on the 61 languages from 25 language families (see the list below).

Dataset

Model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia. Here is the table with number of tokens for each language in the pretraining corpus on a logarithmic scale:

Languages

Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo)

By language family

Language FamilyLanguages
Afro-AsiaticArabic (ar), Hebrew (he)
Austro-AsiaticVietnamese (vi)
AustronesianIndonesian (id), Javanese (jv), Malay (ms), Tagalog (tl)
BalticLatvian (lv), Lithuanian (lt)
BasqueBasque (eu)
DravidianMalayalam (ml), Tamil (ta), Telugu (te)
Indo-European (Armenian)Armenian (hy)
Indo-European (Indo-Aryan)Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur)
Indo-European (Germanic)Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv)
Indo-European (Romance)French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es)
Indo-European (Greek)Greek (el)
Indo-European (Iranian)Ossetian (os), Tajik (tg), Persian (fa)
JaponicJapanese (ja)
KartvelianGeorgian (ka)
KoreanicKorean (ko)
Kra-DaiThai (th)
MongolicBuryat (bxr), Kalmyk (xal), Mongolian (mn)
Niger-CongoSwahili (sw), Yoruba (yo)
SlavicBelarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl)
Sino-TibetanBurmese (my)
Turkic (Karluk)Uzbek (uz)
Turkic (Kipchak)Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt)
Turkic (Oghuz)Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk)
Turkic (Siberian)Tuvan (tyv), Yakut (sax)
UralicEstonian (et), Finnish (fi), Hungarian (hu)