ArabianGPT-01B / README.md
riotu-lab's picture
Update README.md
1fb6286
|
raw
history blame
2.69 kB
---
license: mit
language:
- ar
pipeline_tag: text-generation
tags:
- 'arabic '
- text-generation
---
# Model Description
* Model Name: ArabianGPT
* Architecture: GPT-2
* Layers: 12
* Model Size: 134M
* Context Window Size: 768
> [! NOTE]
> ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic.
# Training
* Dataset: Abu Elkhiar Corpus
* Size: 15.5 GB
* Number of Words: 237,814,541
* Number of Tokens: 1,752,421,071
* Number of Parameters : 134 M Params
* Steps: 337,500
* Loss: 3.97
> [!NOTE]
> The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language.
# Tokenizer
Type: Custom trained SentencePiece tokenizer
Vocabulary Size: 64K
> We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64K. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language.
More info about AraNizer can be found here [Link](https://github.com/omarnj-lab/aranizer/tree/main)
# Usage
ArabianGPT can be used for text generation tasks in Arabic.
### How to use
Here is how to use this model to generate ruby function documentation using Transformers SummarizationPipeline:
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="riotu-lab/ArabianGPT-base" , max_new_tokens = 512)
text = ''
pipe.predict(text)
```
# Limitations
> [!TIP]
> As with any language model, ArabianGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly.
# Ethical Considerations
We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content.
# Citation
If you use ArabianGPT in your research or application, please cite it as follows:
```
@misc{ArabianGPT, 2023,
title={ArabianGPT: A GPT-2 Based Language Model for Arabic},
author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
year={2023},
}
```
# Acknowledgments
> We thank Prince Sultan University, especially the Robotics and Internet of Things Lab, for their support.
# Contact
For inquiries regarding ArabianGPT, please contact onajar@psu.edu.sa.