Model Description
distilbert/distilbert-base-multilingual-cased fine-tuned to split any input text into paragraphs (or chunks).
This model has been trained on data from the English Wikipedia only. However, due to cross-lingual transfer, it is able to chunk text from other languages, too.
F1 Score per Language
The following F1 scores where calculated from the model predictions for 10000 random samples or 20% of all data per language, whichever was smaller. The 104 selected languages are the same ones that the base model was trained on, and the full dataset was created from the HF Wikipedia dataset.
| Language | F1 Score |
|---|---|
| English | 0.802 |
| Tatar | 0.798 |
| Urdu | 0.769 |
| Scots | 0.768 |
| Dutch | 0.766 |
| Aragonese | 0.76 |
| German | 0.757 |
| Ido | 0.757 |
| Chechen | 0.755 |
| French | 0.749 |
| Portuguese | 0.748 |
| Bosnian | 0.74 |
| Galician | 0.74 |
| Spanish | 0.736 |
| Slovak | 0.734 |
| Asturian | 0.723 |
| Minangkabau | 0.723 |
| Polish | 0.722 |
| Bavarian | 0.72 |
| Italian | 0.719 |
| Romanian | 0.719 |
| Malay | 0.717 |
| Catalan | 0.713 |
| Norwegian | 0.712 |
| Hungarian | 0.71 |
| Belarusian | 0.708 |
| Slovenian | 0.708 |
| Bulgarian | 0.705 |
| Danish | 0.703 |
| Czech | 0.697 |
| Greek | 0.697 |
| Western Frisian | 0.696 |
| Russian | 0.696 |
| Persian | 0.693 |
| Ukrainian | 0.693 |
| Indonesian | 0.692 |
| Finnish | 0.688 |
| Norwegian Nynorsk | 0.687 |
| Swedish | 0.686 |
| Occitan | 0.685 |
| Tagalog | 0.684 |
| Chuvash | 0.68 |
| Afrikaans | 0.676 |
| Macedonian | 0.675 |
| Piedmontese | 0.666 |
| Turkish | 0.665 |
| Croatian | 0.664 |
| Estonian | 0.663 |
| Sicilian | 0.663 |
| Sundanese | 0.66 |
| Luxembourgish | 0.657 |
| Lombard | 0.656 |
| Breton | 0.654 |
| Haitian Creole | 0.652 |
| Bashkir | 0.648 |
| Armenian | 0.646 |
| Vietnamese | 0.645 |
| Icelandic | 0.643 |
| Basque | 0.641 |
| Lithuanian | 0.638 |
| Serbian | 0.637 |
| Azerbaijani | 0.634 |
| Irish | 0.628 |
| Latin | 0.626 |
| Swahili | 0.625 |
| Javanese | 0.622 |
| Bangla | 0.621 |
| Uzbek | 0.621 |
| Latvian | 0.62 |
| Arabic | 0.618 |
| Hebrew | 0.618 |
| Kazakh | 0.611 |
| Albanian | 0.608 |
| Telugu | 0.607 |
| Volapük | 0.602 |
| Kyrgyz | 0.592 |
| Tamil | 0.589 |
| Mongolian | 0.584 |
| Punjabi | 0.581 |
| Low Saxon | 0.577 |
| Marathi | 0.576 |
| Georgian | 0.566 |
| Tajik | 0.565 |
| Welsh | 0.562 |
| Serbo-Croatian | 0.557 |
| Yoruba | 0.551 |
| Gujarati | 0.534 |
| Bishnupriya | 0.525 |
| Hindi | 0.518 |
| South Azerbaijani | 0.489 |
| Newari | 0.465 |
| Malagasy | 0.456 |
| Western Punjabi | 0.442 |
| Korean | 0.438 |
| Nepali | 0.404 |
| Malayalam | 0.394 |
| Kannada | 0.356 |
| Waray | 0.261 |
| Burmese | 0.22 |
| Thai | 0.157 |
| Chinese | 0.119 |
| Cebuano | 0.084 |
| Japanese | 0.054 |
| Classical Chinese | 0.05 |
- Downloads last month
- 450