IsmaelMousa commited on
Commit
8358bb4
·
verified ·
1 Parent(s): 726ef27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -3
README.md CHANGED
@@ -1,3 +1,85 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ar
5
+ library_name: tokenizers
6
+ pipeline_tag: summarization
7
+ tags:
8
+ - arabic
9
+ - summarization
10
+ - tokenizers
11
+ - BPE
12
+ ---
13
+
14
+ ## Byte Level (BPE) Tokenizer for Arabic
15
+
16
+ Byte Level Tokenizer for Arabic, a robust tokenizer designed to handle Arabic text with precision and efficiency.
17
+ This tokenizer utilizes a `Byte-Pair Encoding (BPE)` approach to create a vocabulary of `32,000` tokens, catering specifically to the intricacies of the Arabic language.
18
+
19
+ ### Goal
20
+
21
+ This tokenizer was created as part of the development of an Arabic BART transformer model for summarization from scratch using `PyTorch`.
22
+ In adherence to the configurations outlined in the official [BART](https://arxiv.org/abs/1910.13461) paper, which specifies the use of BPE tokenization, I sought a BPE tokenizer specifically tailored for Arabic.
23
+ While there are Arabic-only tokenizers and multilingual BPE tokenizers, a dedicated Arabic BPE tokenizer was not available. This gap inspired the creation of a `BPE` tokenizer focused solely on Arabic, ensuring alignment with BART's recommended configurations and enhancing the effectiveness of Arabic NLP tasks.
24
+
25
+ ### Checkpoint Information
26
+
27
+ - **Name**: `IsmaelMousa/arabic-bpe-tokenizer`
28
+ - **Vocabulary Size**: `32,000`
29
+
30
+ ### Overview
31
+
32
+ The Byte Level Tokenizer is optimized to manage Arabic text, which often includes a range of diacritics, different forms of the same word, and various prefixes and suffixes. This tokenizer addresses these challenges by breaking down text into byte-level tokens, ensuring that it can effectively process and understand the nuances of the Arabic language.
33
+
34
+ ### Features
35
+
36
+ - **Byte-Pair Encoding (BPE)**: Efficiently manages a large vocabulary size while maintaining accuracy.
37
+ - **Comprehensive Coverage**: Handles Arabic script, including diacritics and various word forms.
38
+ - **Flexible Integration**: Easily integrates with the `tokenizers` library for seamless tokenization.
39
+
40
+ ### Installation
41
+
42
+ To use this tokenizer, you need to install the `tokenizers` library. If you haven’t installed it yet, you can do so using pip:
43
+
44
+ ```bash
45
+ pip install tokenizers
46
+ ```
47
+
48
+ ### Example Usage
49
+ Here is an example of how to use the Byte Level Tokenizer with the `tokenizers` library.
50
+
51
+
52
+ This example demonstrates tokenization of the Arabic sentence "لاشيء يعجبني, أريد أن أبكي":
53
+
54
+ ```python
55
+ from tokenizers import Tokenizer
56
+
57
+ tokenizer = Tokenizer.from_pretrained("IsmaelMousa/arabic-bpe-tokenizer")
58
+
59
+ text = "لاشيء يعجبني, أريد أن أبكي"
60
+
61
+ encoded = tokenizer.encode(text)
62
+ decoded = tokenizer.decode(encoded.ids)
63
+
64
+ print("Encoded Tokens:", encoded.tokens)
65
+ print("Token IDs:", encoded.ids)
66
+ print("Decoded Text:", decoded)
67
+
68
+ ```
69
+
70
+ output:
71
+
72
+ ```bash
73
+ Encoded Tokens: ['<s>', 'ÙĦا', 'ĠØ´ÙĬØ¡', 'ĠÙĬع', 'جب', 'ÙĨÙĬ', ',', 'ĠأرÙĬد', 'ĠØ£ÙĨ', 'Ġأب', 'ÙĥÙĬ', '</s>']
74
+
75
+ Token IDs: [0, 419, 1773, 667, 2281, 489, 16, 7578, 331, 985, 1344, 2]
76
+
77
+ Decoded Text: لا شيء يعجبني, أريد أن أبكي
78
+ ```
79
+
80
+ ### Tokenizer Details
81
+ - **Byte-Level Tokenization**: This method ensures that every byte of input text is considered, making it suitable for languages with complex scripts.
82
+ - **Adaptability**: Can be fine-tuned or used as-is, depending on your specific needs and application scenarios.
83
+
84
+ ### License
85
+ This project is licensed under the `MIT` License.