FaBERT: Pre-training BERT on Persian Blogs

Model Details

FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

Features

Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
Remarkable performance across various downstream NLP tasks
BERT architecture with 124 million parameters

Useful Links

Repository: FaBERT on Github
Paper: arXiv preprint

Usage

Loading the Model with MLM head

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")

Downstream Tasks

Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)

Examples on Persian datasets are available in our GitHub repository.

make sure to use the default Fast Tokenizer

Training Details

FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

Hyperparameter	Value
Batch Size	32
Optimizer	Adam
Learning Rate	6e-5
Weight Decay	0.01
Total Steps	18 Million
Warmup Steps	1.8 Million
Precision Format	TF32

Evaluation

Here are some key performance results for the FaBERT model:

Sentiment Analysis

Task	FaBERT	ParsBERT	XLM-R
MirasOpinion	87.51	86.73	84.92
MirasIrony	74.82	71.08	75.51
DeepSentiPers	79.85	74.94	79.00

Named Entity Recognition

Task	FaBERT	ParsBERT	XLM-R
PEYMA	91.39	91.24	90.91
ParsTwiner	82.22	81.13	79.50
MultiCoNER v2	57.92	58.09	51.47

Question Answering

Task	FaBERT	ParsBERT	XLM-R
ParsiNLU	55.87	44.89	42.55
PQuAD	87.34	86.89	87.60
PCoQA	53.51	50.96	51.12

Natural Language Inference & QQP

Task	FaBERT	ParsBERT	XLM-R
FarsTail	84.45	82.52	83.50
SBU-NLI	66.65	58.41	58.85
ParsiNLU QQP	82.62	77.60	79.74

Number of Parameters

	FaBERT	ParsBERT	XLM-R
Parameter Count (M)	124	162	278
Vocabulary Size (K)	50	100	250

For a more detailed performance analysis refer to the paper.

How to Cite

If you use FaBERT in your research or projects, please cite it using the following BibTeX:

@article{masumi2024fabert,
  title={FaBERT: Pre-training BERT on Persian Blogs},
  author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
  journal={arXiv preprint arXiv:2402.06617},
  year={2024}
}