Model Description

distilbert/distilbert-base-multilingual-cased fine-tuned to split any input text into paragraphs (or chunks).

This model has been trained on data from the English Wikipedia only. However, due to cross-lingual transfer, it is able to chunk text from other languages, too.

F1 Score per Language

The following F1 scores where calculated from the model predictions for 10000 random samples or 20% of all data per language, whichever was smaller. The 104 selected languages are the same ones that the base model was trained on, and the full dataset was created from the HF Wikipedia dataset.

Language F1 Score
English 0.802
Tatar 0.798
Urdu 0.769
Scots 0.768
Dutch 0.766
Aragonese 0.76
German 0.757
Ido 0.757
Chechen 0.755
French 0.749
Portuguese 0.748
Bosnian 0.74
Galician 0.74
Spanish 0.736
Slovak 0.734
Asturian 0.723
Minangkabau 0.723
Polish 0.722
Bavarian 0.72
Italian 0.719
Romanian 0.719
Malay 0.717
Catalan 0.713
Norwegian 0.712
Hungarian 0.71
Belarusian 0.708
Slovenian 0.708
Bulgarian 0.705
Danish 0.703
Czech 0.697
Greek 0.697
Western Frisian 0.696
Russian 0.696
Persian 0.693
Ukrainian 0.693
Indonesian 0.692
Finnish 0.688
Norwegian Nynorsk 0.687
Swedish 0.686
Occitan 0.685
Tagalog 0.684
Chuvash 0.68
Afrikaans 0.676
Macedonian 0.675
Piedmontese 0.666
Turkish 0.665
Croatian 0.664
Estonian 0.663
Sicilian 0.663
Sundanese 0.66
Luxembourgish 0.657
Lombard 0.656
Breton 0.654
Haitian Creole 0.652
Bashkir 0.648
Armenian 0.646
Vietnamese 0.645
Icelandic 0.643
Basque 0.641
Lithuanian 0.638
Serbian 0.637
Azerbaijani 0.634
Irish 0.628
Latin 0.626
Swahili 0.625
Javanese 0.622
Bangla 0.621
Uzbek 0.621
Latvian 0.62
Arabic 0.618
Hebrew 0.618
Kazakh 0.611
Albanian 0.608
Telugu 0.607
Volapük 0.602
Kyrgyz 0.592
Tamil 0.589
Mongolian 0.584
Punjabi 0.581
Low Saxon 0.577
Marathi 0.576
Georgian 0.566
Tajik 0.565
Welsh 0.562
Serbo-Croatian 0.557
Yoruba 0.551
Gujarati 0.534
Bishnupriya 0.525
Hindi 0.518
South Azerbaijani 0.489
Newari 0.465
Malagasy 0.456
Western Punjabi 0.442
Korean 0.438
Nepali 0.404
Malayalam 0.394
Kannada 0.356
Waray 0.261
Burmese 0.22
Thai 0.157
Chinese 0.119
Cebuano 0.084
Japanese 0.054
Classical Chinese 0.05
Downloads last month
450
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mamei16/chonky_mdistilbert-base-english-cased

Finetuned
(332)
this model

Dataset used to train mamei16/chonky_mdistilbert-base-english-cased

Space using mamei16/chonky_mdistilbert-base-english-cased 1