--- language: - fa library_name: transformers widget: - text: "ز سوزناکی گفتار من [MASK] بگریست" example_title: "Poetry 1" - text: "نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری" example_title: "Poetry 2" - text: "هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را" example_title: "Poetry 3" - text: "غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند" example_title: "Poetry 4" - text: "این [MASK] اولشه." example_title: "Informal 1" - text: "دیگه خسته شدم! [MASK] اینم شد کار؟!" example_title: "Informal 2" - text: "فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم." example_title: "Informal 3" - text: "تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم." example_title: "Informal 4" - text: "زندگی بدون [MASK] خسته‌کننده است." example_title: "Formal 1" - text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد." example_title: "Formal 2" --- # FaBERT: Pre-training BERT on Persian Blogs ## Model Details FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks. ## Features - Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs - Remarkable performance across various downstream NLP tasks - BERT architecture with 124 million parameters ## Useful Links - **Repository:** [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT) - **Paper:** [arXiv preprint](https://arxiv.org/abs/2402.06617) ## Usage ### Loading the Model with MLM head ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert") ``` ### Downstream Tasks Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training) Examples on Persian datasets are available in our [GitHub repository](#useful-links). **make sure to use the default Fast Tokenizer** ## Training Details FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76. | Hyperparameter | Value | |-------------------|:--------------:| | Batch Size | 32 | | Optimizer | Adam | | Learning Rate | 6e-5 | | Weight Decay | 0.01 | | Total Steps | 18 Million | | Warmup Steps | 1.8 Million | | Precision Format | TF32 | ## Evaluation Here are some key performance results for the FaBERT model: **Sentiment Analysis** | Task | FaBERT | ParsBERT | XLM-R | |:-------------|:------:|:--------:|:-----:| | MirasOpinion | **87.51** | 86.73 | 84.92 | | MirasIrony | 74.82 | 71.08 | **75.51** | | DeepSentiPers | **79.85** | 74.94 | 79.00 | **Named Entity Recognition** | Task | FaBERT | ParsBERT | XLM-R | |:-------------|:------:|:--------:|:-----:| | PEYMA | **91.39** | 91.24 | 90.91 | | ParsTwiner | **82.22** | 81.13 | 79.50 | | MultiCoNER v2 | 57.92 | **58.09** | 51.47 | **Question Answering** | Task | FaBERT | ParsBERT | XLM-R | |:-------------|:------:|:--------:|:-----:| | ParsiNLU | **55.87** | 44.89 | 42.55 | | PQuAD | 87.34 | 86.89 | **87.60** | | PCoQA | **53.51** | 50.96 | 51.12 | **Natural Language Inference & QQP** | Task | FaBERT | ParsBERT | XLM-R | |:-------------|:------:|:--------:|:-----:| | FarsTail | **84.45** | 82.52 | 83.50 | | SBU-NLI | **66.65** | 58.41 | 58.85 | | ParsiNLU QQP | **82.62** | 77.60 | 79.74 | **Number of Parameters** | | FaBERT | ParsBERT | XLM-R | |:-------------|:------:|:--------:|:-----:| | Parameter Count (M) | 124 | 162 | 278 | | Vocabulary Size (K) | 50 | 100 | 250 | For a more detailed performance analysis refer to the paper. ## How to Cite If you use FaBERT in your research or projects, please cite it using the following BibTeX: ```bibtex @article{masumi2024fabert, title={FaBERT: Pre-training BERT on Persian Blogs}, author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid}, journal={arXiv preprint arXiv:2402.06617}, year={2024} } ```