riotu-lab
/

Aranizer-PBE-86k

Arabic Tokenizer

Model card Files Files and versions Community

riotu-lab commited on Aug 25

Commit

3ff8005

•

1 Parent(s): b5c0ab1

update readme.md

Files changed (1) hide show

README.md +51 -3

README.md CHANGED Viewed

@@ -1,3 +1,51 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- ar
+tags:
+- 'Aranizer'
+- Arabic Tokenizer
+- PBE
+---
+# Aranizer | Arabic Tokenizer
+**Aranizer** is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
+## Features
+- **Tokenizer Name**: Aranizer
+- **Type**: PBE tokenizer
+- **Vocabulary Size**: 86,000
+- **Total Number of Tokens**: 1,301,758
+- **Fertility Score**: 1.691
+- It supports Arabic Diacritization
+## How to Use the Aranizer Tokenizer
+The Aranizer tokenizer can be easily loaded using the `transformers` library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:
+```python
+from transformers import AutoTokenizer
+# Load the Aranizer tokenizer
+tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-86k")
+# Example usage
+text = "اكتب النص العربي"
+tokens = tokenizer.tokenize(text)
+token_ids = tokenizer.convert_tokens_to_ids(tokens)
+print("Tokens:", tokens)
+print("Token IDs:", token_ids)
+```
+```markdown
+## Citation
+@article{koubaa2024arabiangpt,
+  title={ArabianGPT: Native Arabic GPT-based Large Language Model},
+  author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
+  year={2024},
+  publisher={Preprints}
+}