faisalq commited on
Commit
84d397a
1 Parent(s): fe53213

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md CHANGED
@@ -1,3 +1,37 @@
1
  ---
2
  license: cc-by-nc-4.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - ar
5
  ---
6
+
7
+ ---
8
+ language:
9
+ - ar
10
+ tags:
11
+ - Arabic BERT
12
+ - Saudi Dialect
13
+ - Twitter
14
+ - Masked Langauge Model
15
+ widget:
16
+ - text: "اللي ما يعرف الصقر [MASK]."
17
+
18
+ ---
19
+
20
+
21
+ **SaudiBERT** is the first pre-trained large language model focused exclusively on Saudi dialect text. The model was pretrained on two large-scale corpora: the Saudi Tweets Mega Corpus (STMC), which contains +141 million tweets, and the Saudi Forum Corpus, which includes +70 million sentences collected from various Saudi online forums. The datasets comprise **26.3GB of text**. The code files along with the results are available on [repo](https://github.com/FaisalQarah/SaudiBERT).
22
+
23
+
24
+
25
+ # BibTex
26
+
27
+ If you use SaudiBERT model in your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (citation details to be updated):
28
+ ```bibtex
29
+ @article{qarah2024saudibert,
30
+ title={SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora},
31
+ author={Qarah, Faisal},
32
+ journal={arXiv preprint arXiv:2405.06239},
33
+ year={2024}
34
+ }
35
+
36
+
37
+ ```