Update README.md

8129d29 verified 5 months ago

4.68 kB

	---
	language:
	- am
	library_name: transformers
	datasets:
	- oscar
	- mc4
	metrics:
	- perplexity
	pipeline_tag: fill-mask
	widget:
	- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
	example_title: Example 1
	- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
	example_title: Example 2
	- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
	example_title: Example 3
	- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
	example_title: Example 4
	---

	# bert-mini-amharic

	This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
	It achieves the following results on the evaluation set:
	- `Loss: 3.11`
	- `Perplexity: 22.42`

	Even though this model only has `10.7 Million` parameters, its performance is only slightly behind the 26x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.

	# How to use
	You can use this model directly with a pipeline for masked language modeling:

	```python
	>>> from transformers import pipeline
	>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
	>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

	[{'score': 0.6525624394416809,
	'token': 9617,
	'token_str': 'ዓመታት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
	{'score': 0.22671808302402496,
	'token': 9345,
	'token_str': 'ዓመት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
	{'score': 0.07071439921855927,
	'token': 10898,
	'token_str': 'አመታት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
	{'score': 0.02838180586695671,
	'token': 9913,
	'token_str': 'አመት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
	{'score': 0.006343209184706211,
	'token': 22459,
	'token_str': 'ዓመታትን',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]
	```

	# Finetuning

	This model was finetuned and evaluated on the following Amharic NLP tasks

	- Sentiment Classification
	- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
	- Code: https://github.com/rasyosef/amharic-sentiment-classification
	- Named Entity Recognition
	- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
	- Code: https://github.com/rasyosef/amharic-named-entity-recognition
	- News Category Classification
	- Dataset: [amharic-news-category-classification](https://github.com/rasyosef/amharic-news-category-classification)
	- Code: https://github.com/rasyosef/amharic-news-category-classification

	### Finetuned Model Performance
	The reported F1 scores are macro averages.

	\|Model\|Size (# params)\| Perplexity\|Sentiment (F1)\| Named Entity Recognition (F1)\|
	\|-----\|---------------\|-----------\|--------------\|------------------------------\|
	\|bert-medium-amharic\|40.5M\|13.74\|0.83\|0.68\|
	\|bert-small-amharic\|27.8M\|15.96\|0.83\|0.68\|
	\|bert-mini-amharic\|10.7M\|22.42\|0.81\|0.64\|
	\|bert-tiny-amharic\|4.18M\|71.52\|0.79\|0.54\|
	\|xlm-roberta-base\|279M\|\|0.83\|0.73\|
	\|am-roberta\|443M\|\|0.82\|0.69\|

	### Amharic News Category Classification

	\|Model\|Size(# params)\|Accuracy\|Precision\|Recall\|F1\|
	\|-----\|--------------\|--------\|---------\|------\|--\|
	\|bert-small-amharic\|25.7M\|0.89\|0.86\|0.87\|0.86\|
	\|bert-mini-amharic\|9.67M\|0.87\|0.83\|0.83\|0.83\|
	\|xlm-roberta-base\|279M\|0.9\|0.88\|0.88\|0.88\|