jncraton commited on
Commit
1b4b5cf
·
1 Parent(s): 8a3d127

Correct model

Browse files
README.md CHANGED
@@ -1,59 +1,70 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- tags:
6
- - ctranslate2
7
- - flan-alpaca-gpt4-xl
8
- - quantization
9
- - int8
10
  ---
11
 
12
- # Model Card for Flan-Alpaca-GPT4-XL Q8
13
 
14
- The model is quantized version of the [declare-lab/flan-alpaca-gpt4-xl](https://huggingface.co/declare-lab/flan-alpaca-gpt4-xl) with int8 quantization.
15
 
16
- ## Model Details
 
17
 
18
- ### Model Description
 
19
 
20
- The model being quantized using [CTranslate2](https://opennmt.net/CTranslate2/) with the following command:
 
 
21
 
22
- ```
23
- ct2-transformers-converter --model declare-lab/flan-alpaca-gpt4-xl --output_dir declare-lab/flan-alpaca-gpt4-xl-ct2 --copy_files generation_config.json tokenizer.json tokenizer_config.json special_tokens_map.json spiece.model --quantization int8 --force --low_cpu_mem_usage
24
- ```
25
 
26
- If you want to perform the quantization yourself, you need to install the following dependencies:
27
 
28
- ```
29
- pip install -qU ctranslate2 transformers[torch] sentencepiece accelerate
30
- ```
31
 
32
- - **Shared by:** Lim Chee Kin
33
- - **License:** Apache 2.0
34
 
35
- ## How to Get Started with the Model
 
36
 
37
- Use the code below to get started with the model.
 
 
 
 
38
 
39
- ```python
40
- import ctranslate2
41
- import transformers
42
 
43
- translator = ctranslate2.Translator("limcheekin/flan-alpaca-gpt4-xl-ct2")
44
- tokenizer = transformers.AutoTokenizer.from_pretrained("limcheekin/flan-alpaca-gpt4-xl-ct2")
 
 
45
 
46
- input_text = "translate English to German: The house is wonderful."
47
- input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))
48
 
49
- results = translator.translate_batch([input_tokens])
 
50
 
51
- output_tokens = results[0].hypotheses[0]
52
- output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
53
 
54
- print(output_text)
55
- ```
 
 
 
 
 
 
56
 
57
- The code is taken from https://opennmt.net/CTranslate2/guides/transformers.html#t5.
58
 
59
- The key method of the code above is `translate_batch`, you can find out [its supported parameters here](https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html#ctranslate2.Translator.translate_batch).
 
 
 
 
 
 
 
 
1
  ---
2
+ license: bsd-3-clause
 
 
 
 
 
 
 
3
  ---
4
 
5
+ # CodeT5+ 220M (further tuned on Python)
6
 
7
+ ## Model description
8
 
9
+ [CodeT5+](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
10
+ It is introduced in the paper:
11
 
12
+ [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
13
+ by [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution).
14
 
15
+ Compared to the original CodeT5 family (base: `220M`, large: `770M`), CodeT5+ is pretrained with a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
16
+ Additionally, it employs a simple yet effective _compute-efficient pretraining_ method to initialize the model components with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen) to efficiently scale up the model (i.e. `2B`, `6B`, `16B`), and adopts a "shallow encoder and deep decoder" architecture.
17
+ Furthermore, it is instruction-tuned to align with natural language instructions (i.e. InstructCodeT5+ 16B) following [Code Alpaca](https://github.com/sahil280114/codealpaca).
18
 
19
+ ## How to use
 
 
20
 
21
+ This model can be easily loaded using the `T5ForConditionalGeneration` functionality and employs the same tokenizer as original [CodeT5](https://github.com/salesforce/CodeT5).
22
 
23
+ ```python
24
+ from transformers import T5ForConditionalGeneration, AutoTokenizer
 
25
 
26
+ checkpoint = "Salesforce/codet5p-220m-py"
27
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
28
 
29
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
30
+ model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)
31
 
32
+ inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
33
+ outputs = model.generate(inputs, max_length=10)
34
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
35
+ # ==> print('Hello World!')
36
+ ```
37
 
38
+ ## Pretraining data
 
 
39
 
40
+ This checkpoint is trained on the stricter permissive subset of the deduplicated version of the [github-code dataset](https://huggingface.co/datasets/codeparrot/github-code).
41
+ The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”, “cc0-1.0”, “unlicense”, “isc”).
42
+ Supported languages (9 in total) are as follows:
43
+ `c`, `c++`, `c-sharp`, `go`, `java`, `javascript`, `php`, `python`, `ruby.`
44
 
45
+ ## Training procedure
 
46
 
47
+ This checkpoint is first trained on the multilingual unimodal code data at the first-stage pretraining, which includes a diverse set of pretraining tasks including _span denoising_ and two variants of _causal language modeling_.
48
+ After that, it is further trained on the Python subset with the causal language modeling objective for another epoch to better adapt for Python code generation. Please refer to the paper for more details.
49
 
50
+ ## Evaluation results
 
51
 
52
+ CodeT5+ models have been comprehensively evaluated on a wide range of code understanding and generation tasks in various settings: _zero-shot_, _finetuning_, and _instruction-tuning_.
53
+ Specifically, CodeT5+ yields substantial performance gains on many downstream tasks compared to their SoTA baselines, e.g.,
54
+ 8 text-to-code retrieval tasks (+3.2 avg. MRR), 2 line-level code completion tasks (+2.1 avg. Exact Match), and 2 retrieval-augmented code generation tasks (+5.8 avg. BLEU-4).
55
+ In 2 math programming tasks on MathQA-Python and GSM8K-Python, CodeT5+ models of below billion-parameter sizes significantly outperform many LLMs of up to 137B parameters.
56
+ Particularly, in the zero-shot text-to-code generation task on HumanEval benchmark, InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001 mode
57
+ Please refer to the [paper](https://arxiv.org/pdf/2305.07922.pdf) for more details.
58
+
59
+ Specifically for this checkpoint, it achieves 12.0% pass@1 on HumanEval in the zero-shot setting, which outperforms much larger LLMs such as Incoder 1.3B’s 8.9%, GPT-Neo 2.7B's 6.4%, and GPT-J 6B's 11.6%.
60
 
61
+ ## BibTeX entry and citation info
62
 
63
+ ```bibtex
64
+ @article{wang2023codet5plus,
65
+ title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
66
+ author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
67
+ journal={arXiv preprint},
68
+ year={2023}
69
+ }
70
+ ```
added_tokens.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
config.json CHANGED
@@ -4,5 +4,6 @@
4
  "bos_token": "<pad>",
5
  "decoder_start_token": "<pad>",
6
  "eos_token": "</s>",
 
7
  "unk_token": "<unk>"
8
  }
 
4
  "bos_token": "<pad>",
5
  "decoder_start_token": "<pad>",
6
  "eos_token": "</s>",
7
+ "layer_norm_epsilon": null,
8
  "unk_token": "<unk>"
9
  }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e767c6025eefa8b28780bc6e9deb02a279af9d19428670014095d75fc42635c3
3
- size 2855549640
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8db41ae88deab3f37ad68e6cd4533e6dc80ad0a920db70d0420f3731061bfab1
3
+ size 223992662
shared_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json CHANGED
@@ -1,107 +1,147 @@
1
  {
2
- "additional_special_tokens": [
3
- "<extra_id_0>",
4
- "<extra_id_1>",
5
- "<extra_id_2>",
6
- "<extra_id_3>",
7
- "<extra_id_4>",
8
- "<extra_id_5>",
9
- "<extra_id_6>",
10
- "<extra_id_7>",
11
- "<extra_id_8>",
12
- "<extra_id_9>",
13
- "<extra_id_10>",
14
- "<extra_id_11>",
15
- "<extra_id_12>",
16
- "<extra_id_13>",
17
- "<extra_id_14>",
18
- "<extra_id_15>",
19
- "<extra_id_16>",
20
- "<extra_id_17>",
21
- "<extra_id_18>",
22
- "<extra_id_19>",
23
- "<extra_id_20>",
24
- "<extra_id_21>",
25
- "<extra_id_22>",
26
- "<extra_id_23>",
27
- "<extra_id_24>",
28
- "<extra_id_25>",
29
- "<extra_id_26>",
30
- "<extra_id_27>",
31
- "<extra_id_28>",
32
- "<extra_id_29>",
33
- "<extra_id_30>",
34
- "<extra_id_31>",
35
- "<extra_id_32>",
36
- "<extra_id_33>",
37
- "<extra_id_34>",
38
- "<extra_id_35>",
39
- "<extra_id_36>",
40
- "<extra_id_37>",
41
- "<extra_id_38>",
42
- "<extra_id_39>",
43
- "<extra_id_40>",
44
- "<extra_id_41>",
45
- "<extra_id_42>",
46
- "<extra_id_43>",
47
- "<extra_id_44>",
48
- "<extra_id_45>",
49
- "<extra_id_46>",
50
- "<extra_id_47>",
51
- "<extra_id_48>",
52
- "<extra_id_49>",
53
- "<extra_id_50>",
54
- "<extra_id_51>",
55
- "<extra_id_52>",
56
- "<extra_id_53>",
57
- "<extra_id_54>",
58
- "<extra_id_55>",
59
- "<extra_id_56>",
60
- "<extra_id_57>",
61
- "<extra_id_58>",
62
- "<extra_id_59>",
63
- "<extra_id_60>",
64
- "<extra_id_61>",
65
- "<extra_id_62>",
66
- "<extra_id_63>",
67
- "<extra_id_64>",
68
- "<extra_id_65>",
69
- "<extra_id_66>",
70
- "<extra_id_67>",
71
- "<extra_id_68>",
72
- "<extra_id_69>",
73
- "<extra_id_70>",
74
- "<extra_id_71>",
75
- "<extra_id_72>",
76
- "<extra_id_73>",
77
- "<extra_id_74>",
78
- "<extra_id_75>",
79
- "<extra_id_76>",
80
- "<extra_id_77>",
81
- "<extra_id_78>",
82
- "<extra_id_79>",
83
- "<extra_id_80>",
84
- "<extra_id_81>",
85
- "<extra_id_82>",
86
- "<extra_id_83>",
87
- "<extra_id_84>",
88
- "<extra_id_85>",
89
- "<extra_id_86>",
90
- "<extra_id_87>",
91
- "<extra_id_88>",
92
- "<extra_id_89>",
93
- "<extra_id_90>",
94
- "<extra_id_91>",
95
- "<extra_id_92>",
96
- "<extra_id_93>",
97
- "<extra_id_94>",
98
- "<extra_id_95>",
99
- "<extra_id_96>",
100
- "<extra_id_97>",
101
- "<extra_id_98>",
102
- "<extra_id_99>"
103
- ],
104
- "eos_token": "</s>",
105
- "pad_token": "<pad>",
106
- "unk_token": "<unk>"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  }
 
1
  {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "single_word": false,
5
+ "lstrip": false,
6
+ "rstrip": false,
7
+ "normalized": true
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "single_word": false,
12
+ "lstrip": false,
13
+ "rstrip": false,
14
+ "normalized": true
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": true
22
+ },
23
+ "sep_token": {
24
+ "content": "</s>",
25
+ "single_word": false,
26
+ "lstrip": false,
27
+ "rstrip": false,
28
+ "normalized": true
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "single_word": false,
33
+ "lstrip": false,
34
+ "rstrip": false,
35
+ "normalized": true
36
+ },
37
+ "cls_token": {
38
+ "content": "<s>",
39
+ "single_word": false,
40
+ "lstrip": false,
41
+ "rstrip": false,
42
+ "normalized": true
43
+ },
44
+ "mask_token": { "content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
45
+ "additional_special_tokens": [
46
+ { "content":"<extra_id_99>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
47
+ { "content":"<extra_id_98>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
48
+ { "content":"<extra_id_97>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
49
+ { "content":"<extra_id_96>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
50
+ { "content":"<extra_id_95>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
51
+ { "content":"<extra_id_94>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
52
+ { "content":"<extra_id_93>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
53
+ { "content":"<extra_id_92>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
54
+ { "content":"<extra_id_91>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
55
+ { "content":"<extra_id_90>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
56
+ { "content":"<extra_id_89>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
57
+ { "content":"<extra_id_88>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
58
+ { "content":"<extra_id_87>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
59
+ { "content":"<extra_id_86>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
60
+ { "content":"<extra_id_85>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
61
+ { "content":"<extra_id_84>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
62
+ { "content":"<extra_id_83>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
63
+ { "content":"<extra_id_82>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
64
+ { "content":"<extra_id_81>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
65
+ { "content":"<extra_id_80>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
66
+ { "content":"<extra_id_79>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
67
+ { "content":"<extra_id_78>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
68
+ { "content":"<extra_id_77>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
69
+ { "content":"<extra_id_76>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
70
+ { "content":"<extra_id_75>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
71
+ { "content":"<extra_id_74>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
72
+ { "content":"<extra_id_73>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
73
+ { "content":"<extra_id_72>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
74
+ { "content":"<extra_id_71>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
75
+ { "content":"<extra_id_70>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
76
+ { "content":"<extra_id_69>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
77
+ { "content":"<extra_id_68>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
78
+ { "content":"<extra_id_67>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
79
+ { "content":"<extra_id_66>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
80
+ { "content":"<extra_id_65>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
81
+ { "content":"<extra_id_64>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
82
+ { "content":"<extra_id_63>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
83
+ { "content":"<extra_id_62>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
84
+ { "content":"<extra_id_61>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
85
+ { "content":"<extra_id_60>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
86
+ { "content":"<extra_id_59>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
87
+ { "content":"<extra_id_58>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
88
+ { "content":"<extra_id_57>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
89
+ { "content":"<extra_id_56>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
90
+ { "content":"<extra_id_55>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
91
+ { "content":"<extra_id_54>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
92
+ { "content":"<extra_id_53>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
93
+ { "content":"<extra_id_52>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
94
+ { "content":"<extra_id_51>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
95
+ { "content":"<extra_id_50>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
96
+ { "content":"<extra_id_49>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
97
+ { "content":"<extra_id_48>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
98
+ { "content":"<extra_id_47>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
99
+ { "content":"<extra_id_46>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
100
+ { "content":"<extra_id_45>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
101
+ { "content":"<extra_id_44>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
102
+ { "content":"<extra_id_43>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
103
+ { "content":"<extra_id_42>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
104
+ { "content":"<extra_id_41>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
105
+ { "content":"<extra_id_40>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
106
+ { "content":"<extra_id_39>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
107
+ { "content":"<extra_id_38>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
108
+ { "content":"<extra_id_37>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
109
+ { "content":"<extra_id_36>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
110
+ { "content":"<extra_id_35>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
111
+ { "content":"<extra_id_34>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
112
+ { "content":"<extra_id_33>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
113
+ { "content":"<extra_id_32>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
114
+ { "content":"<extra_id_31>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
115
+ { "content":"<extra_id_30>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
116
+ { "content":"<extra_id_29>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
117
+ { "content":"<extra_id_28>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
118
+ { "content":"<extra_id_27>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
119
+ { "content":"<extra_id_26>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
120
+ { "content":"<extra_id_25>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
121
+ { "content":"<extra_id_24>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
122
+ { "content":"<extra_id_23>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
123
+ { "content":"<extra_id_22>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
124
+ { "content":"<extra_id_21>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
125
+ { "content":"<extra_id_20>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
126
+ { "content":"<extra_id_19>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
127
+ { "content":"<extra_id_18>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
128
+ { "content":"<extra_id_17>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
129
+ { "content":"<extra_id_16>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
130
+ { "content":"<extra_id_15>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
131
+ { "content":"<extra_id_14>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
132
+ { "content":"<extra_id_13>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
133
+ { "content":"<extra_id_12>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
134
+ { "content":"<extra_id_11>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
135
+ { "content":"<extra_id_10>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
136
+ { "content":"<extra_id_9>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
137
+ { "content":"<extra_id_8>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
138
+ { "content":"<extra_id_7>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
139
+ { "content":"<extra_id_6>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
140
+ { "content":"<extra_id_5>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
141
+ { "content":"<extra_id_4>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
142
+ { "content":"<extra_id_3>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
143
+ { "content":"<extra_id_2>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
144
+ { "content":"<extra_id_1>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
145
+ { "content":"<extra_id_0>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true }
146
+ ]
147
  }
tokenizer_config.json CHANGED
@@ -1,112 +1,62 @@
1
  {
2
- "additional_special_tokens": [
3
- "<extra_id_0>",
4
- "<extra_id_1>",
5
- "<extra_id_2>",
6
- "<extra_id_3>",
7
- "<extra_id_4>",
8
- "<extra_id_5>",
9
- "<extra_id_6>",
10
- "<extra_id_7>",
11
- "<extra_id_8>",
12
- "<extra_id_9>",
13
- "<extra_id_10>",
14
- "<extra_id_11>",
15
- "<extra_id_12>",
16
- "<extra_id_13>",
17
- "<extra_id_14>",
18
- "<extra_id_15>",
19
- "<extra_id_16>",
20
- "<extra_id_17>",
21
- "<extra_id_18>",
22
- "<extra_id_19>",
23
- "<extra_id_20>",
24
- "<extra_id_21>",
25
- "<extra_id_22>",
26
- "<extra_id_23>",
27
- "<extra_id_24>",
28
- "<extra_id_25>",
29
- "<extra_id_26>",
30
- "<extra_id_27>",
31
- "<extra_id_28>",
32
- "<extra_id_29>",
33
- "<extra_id_30>",
34
- "<extra_id_31>",
35
- "<extra_id_32>",
36
- "<extra_id_33>",
37
- "<extra_id_34>",
38
- "<extra_id_35>",
39
- "<extra_id_36>",
40
- "<extra_id_37>",
41
- "<extra_id_38>",
42
- "<extra_id_39>",
43
- "<extra_id_40>",
44
- "<extra_id_41>",
45
- "<extra_id_42>",
46
- "<extra_id_43>",
47
- "<extra_id_44>",
48
- "<extra_id_45>",
49
- "<extra_id_46>",
50
- "<extra_id_47>",
51
- "<extra_id_48>",
52
- "<extra_id_49>",
53
- "<extra_id_50>",
54
- "<extra_id_51>",
55
- "<extra_id_52>",
56
- "<extra_id_53>",
57
- "<extra_id_54>",
58
- "<extra_id_55>",
59
- "<extra_id_56>",
60
- "<extra_id_57>",
61
- "<extra_id_58>",
62
- "<extra_id_59>",
63
- "<extra_id_60>",
64
- "<extra_id_61>",
65
- "<extra_id_62>",
66
- "<extra_id_63>",
67
- "<extra_id_64>",
68
- "<extra_id_65>",
69
- "<extra_id_66>",
70
- "<extra_id_67>",
71
- "<extra_id_68>",
72
- "<extra_id_69>",
73
- "<extra_id_70>",
74
- "<extra_id_71>",
75
- "<extra_id_72>",
76
- "<extra_id_73>",
77
- "<extra_id_74>",
78
- "<extra_id_75>",
79
- "<extra_id_76>",
80
- "<extra_id_77>",
81
- "<extra_id_78>",
82
- "<extra_id_79>",
83
- "<extra_id_80>",
84
- "<extra_id_81>",
85
- "<extra_id_82>",
86
- "<extra_id_83>",
87
- "<extra_id_84>",
88
- "<extra_id_85>",
89
- "<extra_id_86>",
90
- "<extra_id_87>",
91
- "<extra_id_88>",
92
- "<extra_id_89>",
93
- "<extra_id_90>",
94
- "<extra_id_91>",
95
- "<extra_id_92>",
96
- "<extra_id_93>",
97
- "<extra_id_94>",
98
- "<extra_id_95>",
99
- "<extra_id_96>",
100
- "<extra_id_97>",
101
- "<extra_id_98>",
102
- "<extra_id_99>"
103
- ],
104
- "clean_up_tokenization_spaces": true,
105
- "eos_token": "</s>",
106
- "extra_ids": 100,
107
- "model_max_length": 512,
108
- "pad_token": "<pad>",
109
- "sp_model_kwargs": {},
110
- "tokenizer_class": "T5Tokenizer",
111
- "unk_token": "<unk>"
112
  }
 
1
  {
2
+ "errors": "replace",
3
+ "unk_token": {
4
+ "content": "<unk>",
5
+ "single_word": false,
6
+ "lstrip": false,
7
+ "rstrip": false,
8
+ "normalized": true,
9
+ "__type": "AddedToken"
10
+ },
11
+ "bos_token": {
12
+ "content": "<s>",
13
+ "single_word": false,
14
+ "lstrip": false,
15
+ "rstrip": false,
16
+ "normalized": true,
17
+ "__type": "AddedToken"
18
+ },
19
+ "eos_token": {
20
+ "content": "</s>",
21
+ "single_word": false,
22
+ "lstrip": false,
23
+ "rstrip": false,
24
+ "normalized": true,
25
+ "__type": "AddedToken"
26
+ },
27
+ "add_prefix_space": false,
28
+ "sep_token": {
29
+ "content": "</s>",
30
+ "single_word": false,
31
+ "lstrip": false,
32
+ "rstrip": false,
33
+ "normalized": true,
34
+ "__type": "AddedToken"
35
+ },
36
+ "cls_token": {
37
+ "content": "<s>",
38
+ "single_word": false,
39
+ "lstrip": false,
40
+ "rstrip": false,
41
+ "normalized": true,
42
+ "__type": "AddedToken"
43
+ },
44
+ "pad_token": {
45
+ "content": "<pad>",
46
+ "single_word": false,
47
+ "lstrip": false,
48
+ "rstrip": false,
49
+ "normalized": true,
50
+ "__type": "AddedToken"
51
+ },
52
+ "mask_token": {
53
+ "content": "<mask>",
54
+ "single_word": false,
55
+ "lstrip": true,
56
+ "rstrip": false,
57
+ "normalized": true,
58
+ "__type": "AddedToken"
59
+ },
60
+ "model_max_length": 512,
61
+ "tokenizer_class": "RobertaTokenizer"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  }
vocab.json ADDED
The diff for this file is too large to render. See raw diff