fabert / README.md

Update README.md

6aa85d9 verified 9 months ago

5.23 kB

	---
	language:
	- fa
	library_name: transformers
	widget:
	- text: "ز سوزناکی گفتار من [MASK] بگریست"
	example_title: "Poetry 1"
	- text: "نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری"
	example_title: "Poetry 2"
	- text: "هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را"
	example_title: "Poetry 3"
	- text: "غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند"
	example_title: "Poetry 4"
	- text: "این [MASK] اولشه."
	example_title: "Informal 1"
	- text: "دیگه خسته شدم! [MASK] اینم شد کار؟!"
	example_title: "Informal 2"
	- text: "فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم."
	example_title: "Informal 3"
	- text: "تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم."
	example_title: "Informal 4"
	- text: "زندگی بدون [MASK] خسته‌کننده است."
	example_title: "Formal 1"
	- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
	example_title: "Formal 2"
	---


	# FaBERT: Pre-training BERT on Persian Blogs

	## Model Details

	FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

	## Features
	- Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
	- Remarkable performance across various downstream NLP tasks
	- BERT architecture with 124 million parameters

	## Useful Links
	- Repository: [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT)
	- Paper: [arXiv preprint](https://arxiv.org/abs/2402.06617)

	## Usage

	### Loading the Model with MLM head

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
	model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
	```
	### Downstream Tasks

	Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)

	Examples on Persian datasets are available in our [GitHub repository](#useful-links).

	make sure to use the default Fast Tokenizer

	## Training Details

	FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

	\| Hyperparameter \| Value \|
	\|-------------------\|:--------------:\|
	\| Batch Size \| 32 \|
	\| Optimizer \| Adam \|
	\| Learning Rate \| 6e-5 \|
	\| Weight Decay \| 0.01 \|
	\| Total Steps \| 18 Million \|
	\| Warmup Steps \| 1.8 Million \|
	\| Precision Format \| TF32 \|

	## Evaluation

	Here are some key performance results for the FaBERT model:

	Sentiment Analysis
	\| Task \| FaBERT \| ParsBERT \| XLM-R \|
	\|:-------------\|:------:\|:--------:\|:-----:\|
	\| MirasOpinion \| 87.51 \| 86.73 \| 84.92 \|
	\| MirasIrony \| 74.82 \| 71.08 \| 75.51 \|
	\| DeepSentiPers \| 79.85 \| 74.94 \| 79.00 \|

	Named Entity Recognition
	\| Task \| FaBERT \| ParsBERT \| XLM-R \|
	\|:-------------\|:------:\|:--------:\|:-----:\|
	\| PEYMA \| 91.39 \| 91.24 \| 90.91 \|
	\| ParsTwiner \| 82.22 \| 81.13 \| 79.50 \|
	\| MultiCoNER v2 \| 57.92 \| 58.09 \| 51.47 \|

	Question Answering
	\| Task \| FaBERT \| ParsBERT \| XLM-R \|
	\|:-------------\|:------:\|:--------:\|:-----:\|
	\| ParsiNLU \| 55.87 \| 44.89 \| 42.55 \|
	\| PQuAD \| 87.34 \| 86.89 \| 87.60 \|
	\| PCoQA \| 53.51 \| 50.96 \| 51.12 \|

	Natural Language Inference & QQP
	\| Task \| FaBERT \| ParsBERT \| XLM-R \|
	\|:-------------\|:------:\|:--------:\|:-----:\|
	\| FarsTail \| 84.45 \| 82.52 \| 83.50 \|
	\| SBU-NLI \| 66.65 \| 58.41 \| 58.85 \|
	\| ParsiNLU QQP \| 82.62 \| 77.60 \| 79.74 \|

	Number of Parameters
	\| \| FaBERT \| ParsBERT \| XLM-R \|
	\|:-------------\|:------:\|:--------:\|:-----:\|
	\| Parameter Count (M) \| 124 \| 162 \| 278 \|
	\| Vocabulary Size (K) \| 50 \| 100 \| 250 \|

	For a more detailed performance analysis refer to the paper.

	## How to Cite

	If you use FaBERT in your research or projects, please cite it using the following BibTeX:

	```bibtex
	@article{masumi2024fabert,
	title={FaBERT: Pre-training BERT on Persian Blogs},
	author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
	journal={arXiv preprint arXiv:2402.06617},
	year={2024}
	}
	```