Alexandru Gherghescu
commited on
Commit
•
8b5602c
1
Parent(s):
fe8246f
Fix tokenizer
Browse filesInstead of having a trained tokenizer from scratch, replace it with the
actual tokenizer used by the original model.
Note that while the vocabulary and the merges are those from the GPT1
model, the pre- and post-processing might be slightly different, due to
employing different methods of tokenization (spaCy vs HuggingFace's
tokenizers).
- special_tokens_map.json +1 -1
- tokenizer.json +0 -0
- tokenizer_config.json +3 -3
special_tokens_map.json
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
{
|
2 |
-
"
|
3 |
}
|
|
|
1 |
{
|
2 |
+
"unk_token": "<unk>"
|
3 |
}
|
tokenizer.json
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
{
|
2 |
"added_tokens_decoder": {
|
3 |
"0": {
|
4 |
-
"content": "
|
5 |
"lstrip": false,
|
6 |
"normalized": false,
|
7 |
"rstrip": false,
|
@@ -10,7 +10,7 @@
|
|
10 |
}
|
11 |
},
|
12 |
"clean_up_tokenization_spaces": true,
|
13 |
-
"eos_token": "<|endoftext|>",
|
14 |
"model_max_length": 1000000000000000019884624838656,
|
15 |
-
"tokenizer_class": "PreTrainedTokenizerFast"
|
|
|
16 |
}
|
|
|
1 |
{
|
2 |
"added_tokens_decoder": {
|
3 |
"0": {
|
4 |
+
"content": "<unk>",
|
5 |
"lstrip": false,
|
6 |
"normalized": false,
|
7 |
"rstrip": false,
|
|
|
10 |
}
|
11 |
},
|
12 |
"clean_up_tokenization_spaces": true,
|
|
|
13 |
"model_max_length": 1000000000000000019884624838656,
|
14 |
+
"tokenizer_class": "PreTrainedTokenizerFast",
|
15 |
+
"unk_token": "<unk>"
|
16 |
}
|