--- license: mit language: - fa - en - ar --- # Mana Tokenizer The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text. ## Quick Start You can encode/decode your data using Mana Tokenizer like this: ```python from mana_tokenizer import ManaTokenizer tokenizer = ManaTokenizer() text = "سلام من یک متن تست برای تست این تست هستم." print(tokenizer.encode(text)) print(tokenizer.decode(tokenizer.encode(text))) ``` this is the normal encoding of this text: ``` [216, 179, 217, 132, 216, 167, 217, 133, 32, 217, 133, 217, 134, 32, 219, 140, 218, 169, 32, 217, 133, 216, 170, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 216, 168, 216, 177, 216, 167, 219, 140, 32, 216, 170, 216, 179, 216, 170, 32, 216, 167, 219, 140, 217, 134, 32, 216, 170, 216, 179, 216, 170, 32, 217, 135, 216, 179, 216, 170, 217, 133, 46] سلام من یک متن تست برای تست این تست هستم. ``` and here is what Mana tokenizer generate: ``` [30318, 377, 363, 4340, 5828, 513, 5828, 378, 5828, 14471, 46] سلام من یک متن تست برای تست این تست هستم. ``` You can also add special tokens: ```python tokenizer.register_special_tokens({"": 100269}) ``` Batch encode: ```python tokenizer.batch_encode(["یک متن طولانی"]) ``` ## Benchmark - **Benchmark DateTime:** 2024-11-06 16:12:50 - **Mana Batch Encode Time:** 0.10711932182312012 seconds - **Mana Batch Encode Memory Usage:** 13.203125 KB - **Total characters in benchmark:** 131000 ## Special Tokens - **user Token:** `<|user|>` - **assistant Token:** `<|assistant|>` - **end Token:** `<|end|>` - **system Token:** `<|system|>` ## Statistics - **Model Type:** BPE - **Vocabulary Size:** 265,703 - **Character Coverage:** 99.9% - **Total Number of Text Samples:** 1,147,036 - **Total Number of Tokens:** 1,490,338 - **Average Token Length:** 4.51 - **Corpus Size (in bytes):** 1,792,210,410 ## Training Details - **Training Data:** Mana Persian corpus - **Training Script:** Mana Trainer - **Script Version:** 1.2 ## License Mana tokenizer is licensed under the MIT License.