File size: 5,225 Bytes
1521bfa 69a6deb f25adcb 69a6deb 6aa85d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
language:
- fa
library_name: transformers
widget:
- text: "ز سوزناکی گفتار من [MASK] بگریست"
example_title: "Poetry 1"
- text: "نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری"
example_title: "Poetry 2"
- text: "هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را"
example_title: "Poetry 3"
- text: "غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند"
example_title: "Poetry 4"
- text: "این [MASK] اولشه."
example_title: "Informal 1"
- text: "دیگه خسته شدم! [MASK] اینم شد کار؟!"
example_title: "Informal 2"
- text: "فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم."
example_title: "Informal 3"
- text: "تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم."
example_title: "Informal 4"
- text: "زندگی بدون [MASK] خستهکننده است."
example_title: "Formal 1"
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
example_title: "Formal 2"
---
# FaBERT: Pre-training BERT on Persian Blogs
## Model Details
FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.
## Features
- Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
- Remarkable performance across various downstream NLP tasks
- BERT architecture with 124 million parameters
## Useful Links
- **Repository:** [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT)
- **Paper:** [arXiv preprint](https://arxiv.org/abs/2402.06617)
## Usage
### Loading the Model with MLM head
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
```
### Downstream Tasks
Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)
Examples on Persian datasets are available in our [GitHub repository](#useful-links).
**make sure to use the default Fast Tokenizer**
## Training Details
FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.
| Hyperparameter | Value |
|-------------------|:--------------:|
| Batch Size | 32 |
| Optimizer | Adam |
| Learning Rate | 6e-5 |
| Weight Decay | 0.01 |
| Total Steps | 18 Million |
| Warmup Steps | 1.8 Million |
| Precision Format | TF32 |
## Evaluation
Here are some key performance results for the FaBERT model:
**Sentiment Analysis**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| MirasOpinion | **87.51** | 86.73 | 84.92 |
| MirasIrony | 74.82 | 71.08 | **75.51** |
| DeepSentiPers | **79.85** | 74.94 | 79.00 |
**Named Entity Recognition**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| PEYMA | **91.39** | 91.24 | 90.91 |
| ParsTwiner | **82.22** | 81.13 | 79.50 |
| MultiCoNER v2 | 57.92 | **58.09** | 51.47 |
**Question Answering**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| ParsiNLU | **55.87** | 44.89 | 42.55 |
| PQuAD | 87.34 | 86.89 | **87.60** |
| PCoQA | **53.51** | 50.96 | 51.12 |
**Natural Language Inference & QQP**
| Task | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| FarsTail | **84.45** | 82.52 | 83.50 |
| SBU-NLI | **66.65** | 58.41 | 58.85 |
| ParsiNLU QQP | **82.62** | 77.60 | 79.74 |
**Number of Parameters**
| | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| Parameter Count (M) | 124 | 162 | 278 |
| Vocabulary Size (K) | 50 | 100 | 250 |
For a more detailed performance analysis refer to the paper.
## How to Cite
If you use FaBERT in your research or projects, please cite it using the following BibTeX:
```bibtex
@article{masumi2024fabert,
title={FaBERT: Pre-training BERT on Persian Blogs},
author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
journal={arXiv preprint arXiv:2402.06617},
year={2024}
}
```
|