faisalq
/

SaudiBERT

 ---
 license: cc-by-nc-4.0
+language:
+- ar
 ---
+---
+language:
+  - ar
+tags:
+  - Arabic BERT
+  - Saudi Dialect
+  - Twitter
+  - Masked Langauge Model
+widget:
+  - text: "اللي ما يعرف الصقر [MASK]."
+---
+**SaudiBERT** is the first pre-trained large language model focused exclusively on Saudi dialect text. The model was pretrained on two large-scale corpora: the Saudi Tweets Mega Corpus (STMC), which contains +141 million tweets, and the Saudi Forum Corpus, which includes +70 million sentences collected from various Saudi online forums. The datasets comprise **26.3GB of text**. The code files along with the results are available on [repo](https://github.com/FaisalQarah/SaudiBERT).
+# BibTex
+If you use SaudiBERT model in your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (citation details to be updated):
+```bibtex
+@article{qarah2024saudibert,
+  title={SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora},
+  author={Qarah, Faisal},
+  journal={arXiv preprint arXiv:2405.06239},
+  year={2024}
+}
+```