Xrear README.md
Browse files
README.md
ADDED
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- ar
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
tags:
|
7 |
+
- 'arabic '
|
8 |
+
- text-generation
|
9 |
+
---
|
10 |
+
Model Description
|
11 |
+
Model Name: ArabicGPT-S
|
12 |
+
Architecture: GPT-2
|
13 |
+
Layers: 12
|
14 |
+
Model Size: 134M
|
15 |
+
Context Window Size: 768
|
16 |
+
|
17 |
+
ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic.
|
18 |
+
|
19 |
+
Training
|
20 |
+
Dataset: Abu Elkhiar Corpus
|
21 |
+
Size: 15.5 GB
|
22 |
+
Number of Words: 237,814,541
|
23 |
+
Number of Tokens: 1,752,421,071
|
24 |
+
Epochs: 5.87
|
25 |
+
Loss: 3.97
|
26 |
+
|
27 |
+
The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language.
|
28 |
+
|
29 |
+
Tokenizer:
|
30 |
+
Type: Custom trained SentencePiece tokenizer
|
31 |
+
Vocabulary Size: 64K
|
32 |
+
|
33 |
+
We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language.
|
34 |
+
|
35 |
+
Usage
|
36 |
+
ArabianGPT can be used for text generation
|
37 |
+
|
38 |
+
Limitations
|
39 |
+
As with any language model, ArabicGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly.
|
40 |
+
|
41 |
+
Ethical Considerations
|
42 |
+
We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content.
|
43 |
+
|
44 |
+
Citation
|
45 |
+
If you use ArabianGPT in your research or application, please cite it as follows:
|
46 |
+
|
47 |
+
@misc{ArabianGPT, 2023,
|
48 |
+
title={ArabianGPT: A GPT-2 Based Language Model for Arabic},
|
49 |
+
author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
|
50 |
+
affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
|
51 |
+
year={2023},
|
52 |
+
}
|
53 |
+
|
54 |
+
|
55 |
+
Acknowledgments
|
56 |
+
We thank Prince Sultan University espically Robotoics and Internet of Things Lab for suuport
|
57 |
+
|
58 |
+
Contact
|
59 |
+
For inquiries regarding ArabicGPT-S, please contact onajar@psu.edu.sa
|