|
--- |
|
language: |
|
- am |
|
library_name: transformers |
|
datasets: |
|
- oscar |
|
- mc4 |
|
- rasyosef/amharic-sentences-corpus |
|
metrics: |
|
- perplexity |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል። |
|
example_title: Example 1 |
|
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል። |
|
example_title: Example 2 |
|
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው። |
|
example_title: Example 3 |
|
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው። |
|
example_title: Example 4 |
|
--- |
|
|
|
# bert-small-amharic |
|
|
|
This model has the same architecture as [bert-small](https://huggingface.co/prajjwal1/bert-small) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of `290 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k. |
|
It achieves the following results on the evaluation set: |
|
- `Loss: 2.77` |
|
- `Perplexity: 15.96` |
|
|
|
Even though this model only has `27.8 Million` parameters, its performance is comparable to the 10x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) multilingual model on the same Amharic evaluation set. |
|
|
|
# How to use |
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-small-amharic') |
|
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።") |
|
|
|
[{'score': 0.5164287686347961, |
|
'token': 9345, |
|
'token_str': 'ዓመት', |
|
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'}, |
|
{'score': 0.2229153960943222, |
|
'token': 9913, |
|
'token_str': 'አመት', |
|
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'}, |
|
{'score': 0.1806153655052185, |
|
'token': 9617, |
|
'token_str': 'ዓመታት', |
|
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'}, |
|
{'score': 0.05486353859305382, |
|
'token': 10898, |
|
'token_str': 'አመታት', |
|
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'}, |
|
{'score': 0.014157092198729515, |
|
'token': 28157, |
|
'token_str': '##ዓመት', |
|
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተዓመት ተቆጥሯል ።'}] |
|
``` |
|
|
|
# Finetuning |
|
|
|
This model was finetuned and evaluated on the following Amharic NLP tasks |
|
|
|
- **Sentiment Classification** |
|
- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment) |
|
- Code: https://github.com/rasyosef/amharic-sentiment-classification |
|
- **Named Entity Recognition** |
|
- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition) |
|
- Code: https://github.com/rasyosef/amharic-named-entity-recognition |
|
- **News Category Classification** |
|
- Dataset: [amharic-news-category-classification](https://github.com/rasyosef/amharic-news-category-classification) |
|
- Code: https://github.com/rasyosef/amharic-news-category-classification |
|
|
|
### Finetuned Model Performance |
|
The reported F1 scores are macro averages. |
|
|
|
|Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)| |
|
|-----|---------------|-----------|--------------|------------------------------| |
|
|bert-medium-amharic|40.5M|13.74|0.83|0.68| |
|
|**bert-small-amharic**|**27.8M**|**15.96**|**0.83**|**0.68**| |
|
|bert-mini-amharic|10.7M|22.42|0.81|0.64| |
|
|bert-tiny-amharic|4.18M|71.52|0.79|0.54| |
|
|xlm-roberta-base|279M||0.83|0.73| |
|
|am-roberta|443M||0.82|0.69| |
|
|
|
### Amharic News Category Classification |
|
|
|
|Model|Accuracy|Precision|Recall|F1| |
|
|-----|--------|---------|------|--| |
|
|**bert-small-amharic**|0.89|0.86|0.87|0.86| |
|
|bert-mini-amharic|0.87|0.83|0.83|0.83| |
|
|xlm-roberta-base|0.9|0.88|0.88|0.88| |