File size: 4,676 Bytes
ab63a30 b8e4445 ab63a30 e8158a7 9ef0653 8129d29 9ef0653 ab63a30 e8158a7 ab63a30 b1a3492 e8158a7 1dac90a ab63a30 1dac90a c7218c7 1dac90a c7218c7 1dac90a c7218c7 1dac90a c7218c7 b955352 c7218c7 801546d c7218c7 801546d b955352 7e08a27 801546d b955352 7e08a27 801546d b955352 7e08a27 c7218c7 b955352 c7218c7 b955352 17f2a60 b955352 17f2a60 b955352 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
---
language:
- am
library_name: transformers
datasets:
- oscar
- mc4
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
example_title: Example 1
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
example_title: Example 2
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
example_title: Example 3
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
example_title: Example 4
---
# bert-mini-amharic
This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
It achieves the following results on the evaluation set:
- `Loss: 3.11`
- `Perplexity: 22.42`
Even though this model only has `10.7 Million` parameters, its performance is only slightly behind the 26x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.
# How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
[{'score': 0.6525624394416809,
'token': 9617,
'token_str': 'ዓመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
{'score': 0.22671808302402496,
'token': 9345,
'token_str': 'ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
{'score': 0.07071439921855927,
'token': 10898,
'token_str': 'አመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
{'score': 0.02838180586695671,
'token': 9913,
'token_str': 'አመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
{'score': 0.006343209184706211,
'token': 22459,
'token_str': 'ዓመታትን',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]
```
# Finetuning
This model was finetuned and evaluated on the following Amharic NLP tasks
- **Sentiment Classification**
- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
- Code: https://github.com/rasyosef/amharic-sentiment-classification
- **Named Entity Recognition**
- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
- Code: https://github.com/rasyosef/amharic-named-entity-recognition
- **News Category Classification**
- Dataset: [amharic-news-category-classification](https://github.com/rasyosef/amharic-news-category-classification)
- Code: https://github.com/rasyosef/amharic-news-category-classification
### Finetuned Model Performance
The reported F1 scores are macro averages.
|Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)|
|-----|---------------|-----------|--------------|------------------------------|
|bert-medium-amharic|40.5M|13.74|0.83|0.68|
|bert-small-amharic|27.8M|15.96|0.83|0.68|
|**bert-mini-amharic**|**10.7M**|**22.42**|**0.81**|**0.64**|
|bert-tiny-amharic|4.18M|71.52|0.79|0.54|
|xlm-roberta-base|279M||0.83|0.73|
|am-roberta|443M||0.82|0.69|
### Amharic News Category Classification
|Model|Size(# params)|Accuracy|Precision|Recall|F1|
|-----|--------------|--------|---------|------|--|
|bert-small-amharic|25.7M|0.89|0.86|0.87|0.86|
|**bert-mini-amharic**|9.67M|0.87|0.83|0.83|0.83|
|xlm-roberta-base|279M|0.9|0.88|0.88|0.88| |