--- library_name: transformers datasets: - oscar - mc4 - rasyosef/amharic-sentences-corpus language: - am metrics: - perplexity pipeline_tag: fill-mask widget: - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል። example_title: Example 1 - text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል። example_title: Example 2 - text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው። example_title: Example 3 - text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው። example_title: Example 4 --- # bert-medium-amharic This model has the same architecture as [bert-medium](https://huggingface.co/prajjwal1/bert-medium) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of **290 Million tokens**. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k. It achieves the following results on the evaluation set: - `Loss: 2.62` - `Perplexity: 13.74` Even though this model only has **40.5 Million parameters**, its performance is comparable to the 7x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) multilingual model on the same Amharic evaluation set. # How to use You can use this model directly with a pipeline for masked language modeling: ```python >>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='rasyosef/bert-medium-amharic') >>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።") [{'score': 0.5135582089424133, 'token': 9345, 'token_str': 'ዓመት', 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'}, {'score': 0.2923661470413208, 'token': 9617, 'token_str': 'ዓመታት', 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'}, {'score': 0.09527599066495895, 'token': 9913, 'token_str': 'አመት', 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'}, {'score': 0.06960058212280273, 'token': 10898, 'token_str': 'አመታት', 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'}, {'score': 0.019061630591750145, 'token': 28157, 'token_str': '##ዓመት', 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተዓመት ተቆጥሯል ።'}] ``` # Finetuning This model was finetuned and evaluated on the following Amharic NLP tasks - **Sentiment Classification** - Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment) - Code: https://github.com/rasyosef/amharic-sentiment-classification - **Named Entity Recognition** - Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition) - Code: https://github.com/rasyosef/amharic-named-entity-recognition ### Finetuned Model Performance The reported F1 scores are macro averages. |Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)| |-----|---------------|-----------|--------------|------------------------------| |**bert-medium-amharic**|**40.5M**|**13.74**|**0.83**|**0.68**| |bert-small-amharic|27.8M|15.96|0.83|0.68| |bert-mini-amharic|10.7M|22.42|0.81|0.64| |bert-tiny-amharic|4.18M|71.52|0.79|0.54| |xlm-roberta-base|279M||0.83|0.73| |am-roberta|443M||0.82|0.69|