---
language:
- am
library_name: transformers
datasets:
- oscar
- mc4
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
  example_title: Example 1
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
  example_title: Example 2
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
  example_title: Example 3
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
  example_title: Example 4
---

# bert-mini-amharic

This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
It achieves the following results on the evaluation set:
- `Loss: 3.11`
- `Perplexity: 22.42`

Even though this model only has `10.7 Million` parameters, its performance is only slightly behind the 26x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.

# How to use
You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

[{'score': 0.6525624394416809,
  'token': 9617,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
 {'score': 0.22671808302402496,
  'token': 9345,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
 {'score': 0.07071439921855927,
  'token': 10898,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
 {'score': 0.02838180586695671,
  'token': 9913,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
 {'score': 0.006343209184706211,
  'token': 22459,
  'token_str': 'ዓመታትን',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]
```

# Finetuning

This model was finetuned and evaluated on the following Amharic NLP tasks

- **Sentiment Classification**
  - Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
  - Code: https://github.com/rasyosef/amharic-sentiment-classification
- **Named Entity Recognition**
  - Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
  - Code: https://github.com/rasyosef/amharic-named-entity-recognition
- **News Category Classification**
  - Dataset: [amharic-news-category-classification](https://github.com/rasyosef/amharic-news-category-classification)
  - Code: https://github.com/rasyosef/amharic-news-category-classification

### Finetuned Model Performance
The reported F1 scores are macro averages.

|Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)|
|-----|---------------|-----------|--------------|------------------------------|
|bert-medium-amharic|40.5M|13.74|0.83|0.68|
|bert-small-amharic|27.8M|15.96|0.83|0.68|
|**bert-mini-amharic**|**10.7M**|**22.42**|**0.81**|**0.64**|
|bert-tiny-amharic|4.18M|71.52|0.79|0.54|
|xlm-roberta-base|279M||0.83|0.73|
|am-roberta|443M||0.82|0.69|

### Amharic News Category Classification

|Model|Size(# params)|Accuracy|Precision|Recall|F1|
|-----|--------------|--------|---------|------|--|
|bert-small-amharic|25.7M|0.89|0.86|0.87|0.86|
|**bert-mini-amharic**|9.67M|0.87|0.83|0.83|0.83|
|xlm-roberta-base|279M|0.9|0.88|0.88|0.88|