|
--- |
|
license: mit |
|
language: |
|
- ar |
|
pipeline_tag: text-generation |
|
tags: |
|
- 'arabic ' |
|
- text-generation |
|
--- |
|
# Model Description |
|
|
|
* Model Name: ArabianGPT |
|
* Architecture: GPT-2 |
|
* Layers: 12 |
|
* Model Size: 134M |
|
* Context Window Size: 768 |
|
|
|
> [! NOTE] |
|
> ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic. |
|
|
|
# Training |
|
* Dataset: Abu Elkhiar Corpus |
|
* Size: 15.5 GB |
|
* Number of Words: 237,814,541 |
|
* Number of Tokens: 1,752,421,071 |
|
* Number of Parameters : 134 M Params |
|
* Steps: 337,500 |
|
* Loss: 3.97 |
|
|
|
> [!NOTE] |
|
> The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language. |
|
|
|
# Tokenizer |
|
Type: Custom trained SentencePiece tokenizer |
|
Vocabulary Size: 64K |
|
|
|
> We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64K. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language. |
|
|
|
More info about AraNizer can be found here [Link](https://github.com/omarnj-lab/aranizer/tree/main) |
|
|
|
|
|
# Usage |
|
ArabianGPT can be used for text generation tasks in Arabic. |
|
|
|
### How to use |
|
|
|
Here is how to use this model to generate ruby function documentation using Transformers SummarizationPipeline: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("text-generation", model="riotu-lab/ArabianGPT-base" , max_new_tokens = 512) |
|
|
|
text = '' |
|
|
|
pipe.predict(text) |
|
``` |
|
|
|
# Limitations |
|
> [!TIP] |
|
> As with any language model, ArabianGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly. |
|
|
|
# Ethical Considerations |
|
We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content. |
|
|
|
# Citation |
|
|
|
If you use ArabianGPT in your research or application, please cite it as follows: |
|
|
|
``` |
|
@misc{ArabianGPT, 2023, |
|
title={ArabianGPT: A GPT-2 Based Language Model for Arabic}, |
|
author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis}, |
|
affiliation={Prince Sultan University, Riyadh, Saudi Arabia}, |
|
year={2023}, |
|
} |
|
``` |
|
# Acknowledgments |
|
> We thank Prince Sultan University, especially the Robotics and Internet of Things Lab, for their support. |
|
|
|
# Contact |
|
For inquiries regarding ArabianGPT, please contact onajar@psu.edu.sa. |