File size: 4,758 Bytes
b7d1354 16cdbc2 b7d1354 5d2dd32 c60942b 5d2dd32 16cdbc2 a799df0 b4480f0 a799df0 b7d1354 d359b23 b7d1354 c60942b 5d2dd32 c60942b b7d1354 c60942b 4b0534d c60942b 4b0534d c60942b 4b0534d c60942b 4b0534d c60942b 4b0534d 4298ca1 4b0534d f902034 4b0534d 6a075e4 4298ca1 aa46b87 6a075e4 4298ca1 aa46b87 6a075e4 4298ca1 aa46b87 4298ca1 57dc01c 4358c71 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
language:
- am
library_name: transformers
datasets:
- oscar
- mc4
- rasyosef/amharic-sentences-corpus
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
example_title: Example 1
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
example_title: Example 2
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
example_title: Example 3
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
example_title: Example 4
---
# bert-small-amharic
This model has the same architecture as [bert-small](https://huggingface.co/prajjwal1/bert-small) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of `290 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k.
It achieves the following results on the evaluation set:
- `Loss: 2.77`
- `Perplexity: 15.96`
Even though this model only has `27.8 Million` parameters, its performance is comparable to the 10x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) multilingual model on the same Amharic evaluation set.
# How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-small-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
[{'score': 0.5164287686347961,
'token': 9345,
'token_str': 'ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
{'score': 0.2229153960943222,
'token': 9913,
'token_str': 'አመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
{'score': 0.1806153655052185,
'token': 9617,
'token_str': 'ዓመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
{'score': 0.05486353859305382,
'token': 10898,
'token_str': 'አመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
{'score': 0.014157092198729515,
'token': 28157,
'token_str': '##ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተዓመት ተቆጥሯል ።'}]
```
# Finetuning
This model was finetuned and evaluated on the following Amharic NLP tasks
- **Sentiment Classification**
- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
- Code: https://github.com/rasyosef/amharic-sentiment-classification
- **Named Entity Recognition**
- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
- Code: https://github.com/rasyosef/amharic-named-entity-recognition
- **News Category Classification**
- Dataset: [amharic-news-category-classification](https://github.com/rasyosef/amharic-news-category-classification)
- Code: https://github.com/rasyosef/amharic-news-category-classification
### Finetuned Model Performance
The reported F1 scores are macro averages.
|Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)|
|-----|---------------|-----------|--------------|------------------------------|
|bert-medium-amharic|40.5M|13.74|0.83|0.68|
|**bert-small-amharic**|**27.8M**|**15.96**|**0.83**|**0.68**|
|bert-mini-amharic|10.7M|22.42|0.81|0.64|
|bert-tiny-amharic|4.18M|71.52|0.79|0.54|
|xlm-roberta-base|279M||0.83|0.73|
|am-roberta|443M||0.82|0.69|
### Amharic News Category Classification
|Model|Accuracy|Precision|Recall|F1|
|-----|--------|---------|------|--|
|**bert-small-amharic**|0.89|0.86|0.87|0.86|
|bert-mini-amharic|0.87|0.83|0.83|0.83|
|xlm-roberta-base|0.9|0.88|0.88|0.88| |