File size: 3,386 Bytes
8358bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d926cf1
8358bb4
 
 
 
 
 
 
 
 
 
d926cf1
8358bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: mit
language:
- ar
library_name: tokenizers
pipeline_tag: summarization
tags:
- arabic
- summarization
- tokenizers
- BPE
---

  ## Byte Level (BPE) Tokenizer for Arabic

Byte Level Tokenizer for Arabic, a robust tokenizer designed to handle Arabic text with precision and efficiency.
This tokenizer utilizes a `Byte-Pair Encoding (BPE)` approach to create a vocabulary of `50,000` tokens, catering specifically to the intricacies of the Arabic language.

### Goal

This tokenizer was created as part of the development of an Arabic BART transformer model for summarization from scratch using `PyTorch`.
In adherence to the configurations outlined in the official [BART](https://arxiv.org/abs/1910.13461) paper, which specifies the use of BPE tokenization, I sought a BPE tokenizer specifically tailored for Arabic.
While there are Arabic-only tokenizers and multilingual BPE tokenizers, a dedicated Arabic BPE tokenizer was not available. This gap inspired the creation of a `BPE` tokenizer focused solely on Arabic, ensuring alignment with BART's recommended configurations and enhancing the effectiveness of Arabic NLP tasks.

### Checkpoint Information

- **Name**: `IsmaelMousa/arabic-bpe-tokenizer`
- **Vocabulary Size**: `50,000`

### Overview

The Byte Level Tokenizer is optimized to manage Arabic text, which often includes a range of diacritics, different forms of the same word, and various prefixes and suffixes. This tokenizer addresses these challenges by breaking down text into byte-level tokens, ensuring that it can effectively process and understand the nuances of the Arabic language.

### Features

- **Byte-Pair Encoding (BPE)**: Efficiently manages a large vocabulary size while maintaining accuracy.
- **Comprehensive Coverage**: Handles Arabic script, including diacritics and various word forms.
- **Flexible Integration**: Easily integrates with the `tokenizers` library for seamless tokenization.

### Installation

To use this tokenizer, you need to install the `tokenizers` library. If you haven’t installed it yet, you can do so using pip:

```bash
pip install tokenizers
```

### Example Usage
Here is an example of how to use the Byte Level Tokenizer with the `tokenizers` library.


This example demonstrates tokenization of the Arabic sentence "لاشيء يعجبني, أريد أن أبكي":

```python
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("IsmaelMousa/arabic-bpe-tokenizer")

text = "لاشيء يعجبني, أريد أن أبكي"

encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded.ids)

print("Encoded Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)
print("Decoded Text:", decoded)

```

output:

```bash
Encoded Tokens: ['<s>', 'ÙĦا', 'ĠØ´ÙĬØ¡', 'ĠÙĬع', 'جب', 'ÙĨÙĬ', ',', 'ĠأرÙĬد', 'ĠØ£ÙĨ', 'Ġأب', 'ÙĥÙĬ', '</s>'] 

Token IDs: [0, 419, 1773, 667, 2281, 489, 16, 7578, 331, 985, 1344, 2] 

Decoded Text: لا شيء يعجبني, أريد أن أبكي 
```

### Tokenizer Details
- **Byte-Level Tokenization**: This method ensures that every byte of input text is considered, making it suitable for languages with complex scripts.
- **Adaptability**: Can be fine-tuned or used as-is, depending on your specific needs and application scenarios.

### License
This project is licensed under the `MIT` License.