File size: 5,225 Bytes
1521bfa
 
 
 
69a6deb
 
 
 
 
 
 
f25adcb
 
69a6deb
 
 
 
 
 
 
 
 
 
 
 
6aa85d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
language:
- fa
library_name: transformers
widget:
  - text: "ز سوزناکی گفتار من [MASK] بگریست"
    example_title: "Poetry 1"
  - text: "نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری"
    example_title: "Poetry 2"
  - text: "هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را"
    example_title: "Poetry 3"
  - text: "غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند"
    example_title: "Poetry 4"
  - text: "این [MASK] اولشه."
    example_title: "Informal 1"
  - text: "دیگه خسته شدم! [MASK] اینم شد کار؟!"
    example_title: "Informal 2"
  - text: "فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم."
    example_title: "Informal 3"
  - text: "تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم."
    example_title: "Informal 4"
  - text: "زندگی بدون [MASK] خسته‌کننده است."
    example_title: "Formal 1"
  - text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
    example_title: "Formal 2"
---


# FaBERT: Pre-training BERT on Persian Blogs

## Model Details

FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

## Features
- Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
- Remarkable performance across various downstream NLP tasks
- BERT architecture with 124 million parameters

## Useful Links
- **Repository:** [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT)
- **Paper:** [arXiv preprint](https://arxiv.org/abs/2402.06617)

## Usage

### Loading the Model with MLM head

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
```
### Downstream Tasks

Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training) 

Examples on Persian datasets are available in our [GitHub repository](#useful-links).

**make sure to use the default Fast Tokenizer**

## Training Details

FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

| Hyperparameter    | Value        |
|-------------------|:--------------:|
| Batch Size        | 32           |
| Optimizer         | Adam         |
| Learning Rate     | 6e-5         |
| Weight Decay      | 0.01         |
| Total Steps       | 18 Million    |
| Warmup Steps      | 1.8 Million   |
| Precision Format  | TF32          |

## Evaluation

Here are some key performance results for the FaBERT model:

**Sentiment Analysis**
| Task         | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| MirasOpinion | **87.51**      | 86.73     | 84.92  |
| MirasIrony | 74.82      | 71.08     | **75.51**  |
| DeepSentiPers | **79.85**      | 74.94     | 79.00  |

**Named Entity Recognition**
| Task         | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| PEYMA        |   **91.39**    |   91.24   | 90.91  |
| ParsTwiner   |   **82.22**    |  81.13   | 79.50  |
| MultiCoNER v2   |   57.92    |   **58.09**   | 51.47  |

**Question Answering**
| Task         | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| ParsiNLU | **55.87**      | 44.89     | 42.55  |
| PQuAD  | 87.34      | 86.89     | **87.60**  |
| PCoQA  | **53.51**      | 50.96     | 51.12  |

**Natural Language Inference & QQP**
| Task         | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| FarsTail | **84.45**      | 82.52     | 83.50  |
| SBU-NLI | **66.65**      | 58.41     | 58.85  |
| ParsiNLU QQP | **82.62**      | 77.60     | 79.74  |

**Number of Parameters**
|          | FaBERT | ParsBERT | XLM-R |
|:-------------|:------:|:--------:|:-----:|
| Parameter Count (M) | 124      | 162     | 278  |
| Vocabulary Size (K) | 50      | 100     | 250  |

For a more detailed performance analysis refer to the paper.

## How to Cite

If you use FaBERT in your research or projects, please cite it using the following BibTeX:

```bibtex
@article{masumi2024fabert,
  title={FaBERT: Pre-training BERT on Persian Blogs},
  author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
  journal={arXiv preprint arXiv:2402.06617},
  year={2024}
}
```