riotu-lab commited on
Commit
3ff8005
1 Parent(s): b5c0ab1

update readme.md

Browse files
Files changed (1) hide show
  1. README.md +51 -3
README.md CHANGED
@@ -1,3 +1,51 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ tags:
6
+ - 'Aranizer'
7
+ - Arabic Tokenizer
8
+ - PBE
9
+ ---
10
+
11
+ # Aranizer | Arabic Tokenizer
12
+
13
+ **Aranizer** is an Arabic PBE-based tokenizer designed for efficient and versatile tokenization.
14
+
15
+ ## Features
16
+
17
+ - **Tokenizer Name**: Aranizer
18
+ - **Type**: PBE tokenizer
19
+ - **Vocabulary Size**: 86,000
20
+ - **Total Number of Tokens**: 1,301,758
21
+ - **Fertility Score**: 1.691
22
+ - It supports Arabic Diacritization
23
+
24
+ ## How to Use the Aranizer Tokenizer
25
+
26
+ The Aranizer tokenizer can be easily loaded using the `transformers` library from Hugging Face. Below is an example of how to load and use the tokenizer in your Python project:
27
+
28
+ ```python
29
+ from transformers import AutoTokenizer
30
+
31
+ # Load the Aranizer tokenizer
32
+ tokenizer = AutoTokenizer.from_pretrained("riotu-lab/Aranizer-PBE-86k")
33
+
34
+ # Example usage
35
+ text = "اكتب النص العربي"
36
+ tokens = tokenizer.tokenize(text)
37
+ token_ids = tokenizer.convert_tokens_to_ids(tokens)
38
+
39
+ print("Tokens:", tokens)
40
+ print("Token IDs:", token_ids)
41
+ ```
42
+
43
+ ```markdown
44
+ ## Citation
45
+
46
+ @article{koubaa2024arabiangpt,
47
+ title={ArabianGPT: Native Arabic GPT-based Large Language Model},
48
+ author={Koubaa, Anis and Ammar, Adel and Ghouti, Lahouari and Necar, Omer and Sibaee, Serry},
49
+ year={2024},
50
+ publisher={Preprints}
51
+ }