riotu-lab
/

ArabianGPT-01B

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ArabianGPT-01B / README.md

riotu-lab's picture

Update README.md

1fb6286 11 months ago

|

2.69 kB

	---
	license: mit
	language:
	- ar
	pipeline_tag: text-generation
	tags:
	- 'arabic '
	- text-generation
	---
	# Model Description

	* Model Name: ArabianGPT
	* Architecture: GPT-2
	* Layers: 12
	* Model Size: 134M
	* Context Window Size: 768

	> [! NOTE]
	> ArabianGPT is a custom-trained version of the GPT-2 base model, specifically tailored for the Arabic language. It is designed to understand and generate Arabic text, making it suitable for various natural language processing tasks in Arabic.

	# Training
	* Dataset: Abu Elkhiar Corpus
	* Size: 15.5 GB
	* Number of Words: 237,814,541
	* Number of Tokens: 1,752,421,071
	* Number of Parameters : 134 M Params
	* Steps: 337,500
	* Loss: 3.97

	> [!NOTE]
	> The model was trained on the Abu Elkhiar dataset, a comprehensive Arabic text corpus encompassing a wide range of topics. The training process focused on adapting the model to understand the nuances and complexities of the Arabic language.

	# Tokenizer
	Type: Custom trained SentencePiece tokenizer
	Vocabulary Size: 64K

	> We employed AraNizer, a custom trained tokenizer based on the SentencePiece model, with a vocabulary size of 64K. This choice was made to optimize the model's performance for the specific characteristics of the Arabic language.

	More info about AraNizer can be found here [Link](https://github.com/omarnj-lab/aranizer/tree/main)


	# Usage
	ArabianGPT can be used for text generation tasks in Arabic.

	### How to use

	Here is how to use this model to generate ruby function documentation using Transformers SummarizationPipeline:

	```python
	from transformers import pipeline

	pipe = pipeline("text-generation", model="riotu-lab/ArabianGPT-base" , max_new_tokens = 512)

	text = ''

	pipe.predict(text)
	```

	# Limitations
	> [!TIP]
	> As with any language model, ArabianGPT may have limitations in understanding context or generating text in certain scenarios. Users should be aware of these limitations and use the model accordingly.

	# Ethical Considerations
	We emphasize responsible usage of ArabianGPT. Users should ensure that the generated text is used ethically and does not propagate misinformation or harmful content.

	# Citation

	If you use ArabianGPT in your research or application, please cite it as follows:

	```
	@misc{ArabianGPT, 2023,
	title={ArabianGPT: A GPT-2 Based Language Model for Arabic},
	author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
	affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
	year={2023},
	}
	```
	# Acknowledgments
	> We thank Prince Sultan University, especially the Robotics and Internet of Things Lab, for their support.

	# Contact
	For inquiries regarding ArabianGPT, please contact onajar@psu.edu.sa.