HeshamElsherif685 commited on
Commit
4698701
1 Parent(s): 96d7fd8

Upload tokenizer

Browse files
Files changed (5) hide show
  1. README.md +3 -0
  2. merges.txt +1 -1
  3. special_tokens_map.json +5 -1
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +19 -1
README.md CHANGED
@@ -1,3 +1,6 @@
 
 
 
1
  # CodeParrot
2
 
3
  This is a small version of the CodeParrot tokenizer trained on the [CodeParrot Python code dataset](https://huggingface.co/datasets/transformersbook/codeparrot). The tokenizer is trained in Chapter 10: Training Transformers from Scratch in the [NLP with Transformers book](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/). You can find the full code in the accompanying [Github repository](https://github.com/nlp-with-transformers/notebooks/blob/main/10_transformers-from-scratch.ipynb).
 
1
+ ---
2
+ {}
3
+ ---
4
  # CodeParrot
5
 
6
  This is a small version of the CodeParrot tokenizer trained on the [CodeParrot Python code dataset](https://huggingface.co/datasets/transformersbook/codeparrot). The tokenizer is trained in Chapter 10: Training Transformers from Scratch in the [NLP with Transformers book](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/). You can find the full code in the accompanying [Github repository](https://github.com/nlp-with-transformers/notebooks/blob/main/10_transformers-from-scratch.ipynb).
merges.txt CHANGED
@@ -1,4 +1,4 @@
1
- #version: 0.2 - Trained by `huggingface/tokenizers`
2
  Ġ Ġ
3
  ĠĠ ĠĠ
4
  ĠĠ Ġ
 
1
+ #version: 0.2
2
  Ġ Ġ
3
  ĠĠ ĠĠ
4
  ĠĠ Ġ
special_tokens_map.json CHANGED
@@ -1 +1,5 @@
1
- {"bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>"}
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1 +1,19 @@
1
- {"unk_token": "<|endoftext|>", "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "add_prefix_space": false, "model_max_length": 1024, "special_tokens_map_file": null, "name_or_path": "thomwolf/codeparrot-small-vocabulary", "tokenizer_class": "GPT2Tokenizer"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": true,
15
+ "eos_token": "<|endoftext|>",
16
+ "model_max_length": 1024,
17
+ "tokenizer_class": "GPT2Tokenizer",
18
+ "unk_token": "<|endoftext|>"
19
+ }