File size: 2,689 Bytes
d1f5d67
 
 
 
 
 
 
 
 
b253885
 
8461adb
 
 
 
 
 
1fb6286
8461adb
d1f5d67
b253885
8461adb
 
 
 
 
 
 
1fb6286
 
 
d1f5d67
b253885
 
d1f5d67
8461adb
 
b253885
 
 
 
 
 
d1f5d67
b253885
d1f5d67
b253885
d1f5d67
b253885
 
d1f5d67
c0cbbdb
b253885
 
 
 
 
 
 
1fb6286
 
b253885
 
d1f5d67
 
b253885
1fb6286
d1f5d67
 
c0cbbdb
d1f5d67
b253885
 
 
 
d1f5d67
c0cbbdb
b253885
1fb6286
d1f5d67
b253885
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: mit
language:
- ar
pipeline_tag: text-generation
tags:
- 'arabic '
- text-generation
---
# Model Description

* Model Name: ArabianGPT
* Architecture: GPT-2
* Layers: 12
* Model Size: 134M
* Context Window Size: 768

> [! NOTE]
> ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic.

# Training
* Dataset: Abu Elkhiar Corpus
* Size: 15.5 GB
* Number of Words: 237,814,541
* Number of Tokens: 1,752,421,071
* Number of Parameters : 134 M Params
* Steps: 337,500
* Loss: 3.97

> [!NOTE]
> The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language.

# Tokenizer
Type: Custom trained SentencePiece tokenizer
Vocabulary Size: 64K

> We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64K. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language.

More info about AraNizer can be found here [Link](https://github.com/omarnj-lab/aranizer/tree/main)


# Usage
ArabianGPT can be used for text generation tasks in Arabic.

### How to use

Here is how to use this model to generate ruby function documentation using Transformers SummarizationPipeline:

```python
from transformers import pipeline

pipe = pipeline("text-generation", model="riotu-lab/ArabianGPT-base" , max_new_tokens = 512)

text = ''

pipe.predict(text)
```

# Limitations
> [!TIP]
> As with any language model, ArabianGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly.

# Ethical Considerations
We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content.

# Citation

If you use ArabianGPT in your research or application, please cite it as follows:

```
@misc{ArabianGPT, 2023,
  title={ArabianGPT: A GPT-2 Based Language Model for Arabic},
  author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
  affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
  year={2023},
}
```
# Acknowledgments
> We thank Prince Sultan University, especially the Robotics and Internet of Things Lab, for their support.

# Contact
For inquiries regarding ArabianGPT, please contact onajar@psu.edu.sa.