Update README.md
Browse files
README.md
CHANGED
@@ -23,4 +23,105 @@ widget:
|
|
23 |
example_title: "Formal 1"
|
24 |
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
|
25 |
example_title: "Formal 2"
|
26 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
example_title: "Formal 1"
|
24 |
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
|
25 |
example_title: "Formal 2"
|
26 |
+
---
|
27 |
+
|
28 |
+
|
29 |
+
# FaBERT: Pre-training BERT on Persian Blogs
|
30 |
+
|
31 |
+
## Model Details
|
32 |
+
|
33 |
+
FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.
|
34 |
+
|
35 |
+
## Features
|
36 |
+
- Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
|
37 |
+
- Remarkable performance across various downstream NLP tasks
|
38 |
+
- BERT architecture with 124 million parameters
|
39 |
+
|
40 |
+
## Useful Links
|
41 |
+
- **Repository:** [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT)
|
42 |
+
- **Paper:** [arXiv preprint](https://arxiv.org/abs/2402.06617)
|
43 |
+
|
44 |
+
## Usage
|
45 |
+
|
46 |
+
### Loading the Model with MLM head
|
47 |
+
|
48 |
+
```python
|
49 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
50 |
+
|
51 |
+
tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
|
52 |
+
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
|
53 |
+
```
|
54 |
+
### Downstream Tasks
|
55 |
+
|
56 |
+
Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)
|
57 |
+
|
58 |
+
Examples on Persian datasets are available in our [GitHub repository](#useful-links).
|
59 |
+
|
60 |
+
**make sure to use the default Fast Tokenizer**
|
61 |
+
|
62 |
+
## Training Details
|
63 |
+
|
64 |
+
FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.
|
65 |
+
|
66 |
+
| Hyperparameter | Value |
|
67 |
+
|-------------------|:--------------:|
|
68 |
+
| Batch Size | 32 |
|
69 |
+
| Optimizer | Adam |
|
70 |
+
| Learning Rate | 6e-5 |
|
71 |
+
| Weight Decay | 0.01 |
|
72 |
+
| Total Steps | 18 Million |
|
73 |
+
| Warmup Steps | 1.8 Million |
|
74 |
+
| Precision Format | TF32 |
|
75 |
+
|
76 |
+
## Evaluation
|
77 |
+
|
78 |
+
Here are some key performance results for the FaBERT model:
|
79 |
+
|
80 |
+
**Sentiment Analysis**
|
81 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
82 |
+
|:-------------|:------:|:--------:|:-----:|
|
83 |
+
| MirasOpinion | **87.51** | 86.73 | 84.92 |
|
84 |
+
| MirasIrony | 74.82 | 71.08 | **75.51** |
|
85 |
+
| DeepSentiPers | **79.85** | 74.94 | 79.00 |
|
86 |
+
|
87 |
+
**Named Entity Recognition**
|
88 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
89 |
+
|:-------------|:------:|:--------:|:-----:|
|
90 |
+
| PEYMA | **91.39** | 91.24 | 90.91 |
|
91 |
+
| ParsTwiner | **82.22** | 81.13 | 79.50 |
|
92 |
+
| MultiCoNER v2 | 57.92 | **58.09** | 51.47 |
|
93 |
+
|
94 |
+
**Question Answering**
|
95 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
96 |
+
|:-------------|:------:|:--------:|:-----:|
|
97 |
+
| ParsiNLU | **55.87** | 44.89 | 42.55 |
|
98 |
+
| PQuAD | 87.34 | 86.89 | **87.60** |
|
99 |
+
| PCoQA | **53.51** | 50.96 | 51.12 |
|
100 |
+
|
101 |
+
**Natural Language Inference & QQP**
|
102 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
103 |
+
|:-------------|:------:|:--------:|:-----:|
|
104 |
+
| FarsTail | **84.45** | 82.52 | 83.50 |
|
105 |
+
| SBU-NLI | **66.65** | 58.41 | 58.85 |
|
106 |
+
| ParsiNLU QQP | **82.62** | 77.60 | 79.74 |
|
107 |
+
|
108 |
+
**Number of Parameters**
|
109 |
+
| | FaBERT | ParsBERT | XLM-R |
|
110 |
+
|:-------------|:------:|:--------:|:-----:|
|
111 |
+
| Parameter Count (M) | 124 | 162 | 278 |
|
112 |
+
| Vocabulary Size (K) | 50 | 100 | 250 |
|
113 |
+
|
114 |
+
For a more detailed performance analysis refer to the paper.
|
115 |
+
|
116 |
+
## How to Cite
|
117 |
+
|
118 |
+
If you use FaBERT in your research or projects, please cite it using the following BibTeX:
|
119 |
+
|
120 |
+
```bibtex
|
121 |
+
@article{masumi2024fabert,
|
122 |
+
title={FaBERT: Pre-training BERT on Persian Blogs},
|
123 |
+
author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
|
124 |
+
journal={arXiv preprint arXiv:2402.06617},
|
125 |
+
year={2024}
|
126 |
+
}
|
127 |
+
```
|