tspersian commited on
Commit
8eb50fc
1 Parent(s): df0d099

Updated files and statistics.

Browse files
README.md CHANGED
@@ -2,43 +2,37 @@
2
  license: mit
3
  language:
4
  - fa
 
 
5
  ---
6
 
7
  # Mana Tokenizer
8
 
9
- The Mana Tokenizer is a custom-trained SentencePiece tokenizer for Persian text, trained on a combination of the Persian Wikipedia and Ganjoor datasets. The tokenizer uses the Unigram model type, optimized for handling the unique characteristics of Persian text.
10
 
11
  ## Special Tokens
12
 
13
- - **UNK Token:** `<unk>`
14
- - **BOS Token:** `<s>`
15
- - **EOS Token:** `</s>`
16
- - **PAD Token:** `<pad>`
17
 
18
- ## Usage
19
-
20
- You can load this tokenizer using the `transformers` library as follows:
21
-
22
- ```python
23
- from transformers import PreTrainedTokenizerFast
24
-
25
- tokenizer = PreTrainedTokenizerFast.from_pretrained("tspersian/mana_tokenizer")
26
-
27
- text = "این یک تست است."
28
- encoded = tokenizer(text)
29
- print(f"Encoded: {encoded}")
30
-
31
- decoded = tokenizer.decode(encoded['input_ids'])
32
- print(f"Decoded: {decoded}")
33
- ```
34
 
 
 
 
 
 
 
 
35
 
36
- ## Statistics
37
 
38
- Vocabulary Size: 199,997
39
- Character Coverage: 99.9%
40
- Total Number of Text Samples: 1,022,675
41
 
42
  ## License
43
 
44
- This tokenizer is licensed under the MIT License.
 
2
  license: mit
3
  language:
4
  - fa
5
+ - en
6
+ - ar
7
  ---
8
 
9
  # Mana Tokenizer
10
 
11
+ The Mana Tokenizer is a custom-trained BPE tokenizer designed for Persian text. It is trained on a combination of huge Persian corpus. The tokenizer is built using the BPE with high character coverage to handle diverse Persian text.
12
 
13
  ## Special Tokens
14
 
15
+ - **user Token:** `<|user|>`
16
+ - **assistant Token:** `<|assistant|>`
17
+ - **end Token:** `<|end|>`
18
+ - **system Token:** `<|system|>`
19
 
20
+ ## Statistics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ - **Model Type:** BPE
23
+ - **Vocabulary Size:** 265,703
24
+ - **Character Coverage:** 99.9%
25
+ - **Total Number of Text Samples: 1,147,036
26
+ - **Total Number of Tokens: 1,490,338
27
+ - **Average Token Length: 4.51
28
+ - **Corpus Size (in bytes): 1,792,210,410
29
 
30
+ ## Training Details
31
 
32
+ - **Training Data: Mana Persian corpus
33
+ - **Training Script: Mana Trainer
34
+ - **Script Version: 1.2
35
 
36
  ## License
37
 
38
+ Mana tokenizer is licensed under the MIT License.
mana_tokenizer.model → mana.model RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9fad0857866e56ec1708ab58165b1e536bebfe0bdaba1fbd3c82e4aeab9dd55d
3
- size 4663060
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e666a42308d210e029f8e9aa8c0056950ecd785e61514230fda35ec2962aa490
3
+ size 2915213
mana_tokenizer.vocab → mana.vocab RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:81b98c535bc5f759cc987aeddd5dc86ff17ccde04761b245368572c21feba5ca
3
- size 4604696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a20c30df07ce2d728dbdc5b86ba88cb65ebeca8af4361b425171051e0d3847bb
3
+ size 11128488
special_tokens_map.json DELETED
@@ -1,6 +0,0 @@
1
- {
2
- "unk_token": "<unk>",
3
- "bos_token": "<s>",
4
- "eos_token": "</s>",
5
- "pad_token": "<pad>"
6
- }
 
 
 
 
 
 
 
tokenizer_config.json DELETED
@@ -1,9 +0,0 @@
1
- {
2
- "model_type": "unigram",
3
- "bos_token_id": 1,
4
- "eos_token_id": 2,
5
- "unk_token_id": 0,
6
- "pad_token_id": 3,
7
- "do_lower_case": false,
8
- "max_length": 512
9
- }