File size: 1,093 Bytes
b05c2b0
 
84d397a
 
 
 
 
 
 
 
 
 
 
 
418b724
 
 
 
84d397a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
license: cc-by-nc-4.0
language: 
  - ar
tags:
  - Arabic BERT
  - Saudi Dialect
  - Twitter
  - Masked Langauge Model
widget:
  - text: "اللي ما يعرف الصقر [MASK]."

---

---

---


**SaudiBERT** is the first pre-trained large language model focused exclusively on Saudi dialect text. The model was pretrained on two large-scale corpora: the Saudi Tweets Mega Corpus (STMC), which contains +141 million tweets, and the Saudi Forum Corpus, which includes +70 million sentences collected from various Saudi online forums. The datasets comprise **26.3GB of text**. The code files along with the results are available on [repo](https://github.com/FaisalQarah/SaudiBERT).



# BibTex

If you use SaudiBERT model in your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (citation details to be updated):
```bibtex
@article{qarah2024saudibert,
  title={SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora},
  author={Qarah, Faisal},
  journal={arXiv preprint arXiv:2405.06239},
  year={2024}
}


```