M98M commited on
Commit
6aa85d9
1 Parent(s): f25adcb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -1
README.md CHANGED
@@ -23,4 +23,105 @@ widget:
23
  example_title: "Formal 1"
24
  - text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
25
  example_title: "Formal 2"
26
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  example_title: "Formal 1"
24
  - text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
25
  example_title: "Formal 2"
26
+ ---
27
+
28
+
29
+ # FaBERT: Pre-training BERT on Persian Blogs
30
+
31
+ ## Model Details
32
+
33
+ FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.
34
+
35
+ ## Features
36
+ - Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
37
+ - Remarkable performance across various downstream NLP tasks
38
+ - BERT architecture with 124 million parameters
39
+
40
+ ## Useful Links
41
+ - **Repository:** [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT)
42
+ - **Paper:** [arXiv preprint](https://arxiv.org/abs/2402.06617)
43
+
44
+ ## Usage
45
+
46
+ ### Loading the Model with MLM head
47
+
48
+ ```python
49
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
52
+ model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
53
+ ```
54
+ ### Downstream Tasks
55
+
56
+ Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)
57
+
58
+ Examples on Persian datasets are available in our [GitHub repository](#useful-links).
59
+
60
+ **make sure to use the default Fast Tokenizer**
61
+
62
+ ## Training Details
63
+
64
+ FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.
65
+
66
+ | Hyperparameter | Value |
67
+ |-------------------|:--------------:|
68
+ | Batch Size | 32 |
69
+ | Optimizer | Adam |
70
+ | Learning Rate | 6e-5 |
71
+ | Weight Decay | 0.01 |
72
+ | Total Steps | 18 Million |
73
+ | Warmup Steps | 1.8 Million |
74
+ | Precision Format | TF32 |
75
+
76
+ ## Evaluation
77
+
78
+ Here are some key performance results for the FaBERT model:
79
+
80
+ **Sentiment Analysis**
81
+ | Task | FaBERT | ParsBERT | XLM-R |
82
+ |:-------------|:------:|:--------:|:-----:|
83
+ | MirasOpinion | **87.51** | 86.73 | 84.92 |
84
+ | MirasIrony | 74.82 | 71.08 | **75.51** |
85
+ | DeepSentiPers | **79.85** | 74.94 | 79.00 |
86
+
87
+ **Named Entity Recognition**
88
+ | Task | FaBERT | ParsBERT | XLM-R |
89
+ |:-------------|:------:|:--------:|:-----:|
90
+ | PEYMA | **91.39** | 91.24 | 90.91 |
91
+ | ParsTwiner | **82.22** | 81.13 | 79.50 |
92
+ | MultiCoNER v2 | 57.92 | **58.09** | 51.47 |
93
+
94
+ **Question Answering**
95
+ | Task | FaBERT | ParsBERT | XLM-R |
96
+ |:-------------|:------:|:--------:|:-----:|
97
+ | ParsiNLU | **55.87** | 44.89 | 42.55 |
98
+ | PQuAD | 87.34 | 86.89 | **87.60** |
99
+ | PCoQA | **53.51** | 50.96 | 51.12 |
100
+
101
+ **Natural Language Inference & QQP**
102
+ | Task | FaBERT | ParsBERT | XLM-R |
103
+ |:-------------|:------:|:--------:|:-----:|
104
+ | FarsTail | **84.45** | 82.52 | 83.50 |
105
+ | SBU-NLI | **66.65** | 58.41 | 58.85 |
106
+ | ParsiNLU QQP | **82.62** | 77.60 | 79.74 |
107
+
108
+ **Number of Parameters**
109
+ | | FaBERT | ParsBERT | XLM-R |
110
+ |:-------------|:------:|:--------:|:-----:|
111
+ | Parameter Count (M) | 124 | 162 | 278 |
112
+ | Vocabulary Size (K) | 50 | 100 | 250 |
113
+
114
+ For a more detailed performance analysis refer to the paper.
115
+
116
+ ## How to Cite
117
+
118
+ If you use FaBERT in your research or projects, please cite it using the following BibTeX:
119
+
120
+ ```bibtex
121
+ @article{masumi2024fabert,
122
+ title={FaBERT: Pre-training BERT on Persian Blogs},
123
+ author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
124
+ journal={arXiv preprint arXiv:2402.06617},
125
+ year={2024}
126
+ }
127
+ ```