Training languages in the model card

#9
by fyvo - opened
BigScience Workshop org
edited Jul 4, 2022

The model card does not show the proportion of Arabic in the training data. The distribution of languages from the Niger-Congo family contains 'Kuganda', a probable misspelling of 'Luganda', spoken in Uganda. It is difficult to tell, as the corpora for Niger-Congo languages are not documented individually.

fyvo changed pull request status to open
BigScience Workshop org

Thanks for pointing out this!
I think it is worth it to open a PR on the main bloom repo as well since the model cards have been copied from there
cc-ing also @cakiki in case I did not missed anything

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment