shaoyuyoung
commited on
Commit
·
d49cf65
1
Parent(s):
74c15ab
Upload 9 files
Browse files- .gitattributes +3 -10
- README.md +78 -1
- added_tokens.json +1 -0
- config.json +65 -0
- merges.txt +0 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +147 -0
- tokenizer_config.json +62 -0
- vocab.json +0 -0
.gitattributes
CHANGED
@@ -1,34 +1,27 @@
|
|
1 |
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
*.bin filter=lfs diff=lfs merge=lfs -text
|
|
|
4 |
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
-
|
26 |
-
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
*.tflite filter=lfs diff=lfs merge=lfs -text
|
29 |
*.tgz filter=lfs diff=lfs merge=lfs -text
|
30 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
31 |
*.xz filter=lfs diff=lfs merge=lfs -text
|
32 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
33 |
-
*.
|
34 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
1 |
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bin.* filter=lfs diff=lfs merge=lfs -text
|
5 |
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
|
|
6 |
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
|
|
11 |
*.model filter=lfs diff=lfs merge=lfs -text
|
12 |
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
13 |
*.onnx filter=lfs diff=lfs merge=lfs -text
|
14 |
*.ot filter=lfs diff=lfs merge=lfs -text
|
15 |
*.parquet filter=lfs diff=lfs merge=lfs -text
|
16 |
*.pb filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
17 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
18 |
*.pth filter=lfs diff=lfs merge=lfs -text
|
19 |
*.rar filter=lfs diff=lfs merge=lfs -text
|
20 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
|
21 |
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
22 |
*.tflite filter=lfs diff=lfs merge=lfs -text
|
23 |
*.tgz filter=lfs diff=lfs merge=lfs -text
|
|
|
24 |
*.xz filter=lfs diff=lfs merge=lfs -text
|
25 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
26 |
+
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
27 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,80 @@
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
+
tags:
|
4 |
+
- codet5
|
5 |
+
datasets:
|
6 |
+
- code_search_net
|
7 |
+
inference: false
|
8 |
---
|
9 |
+
|
10 |
+
# CodeT5 (base-sized model)
|
11 |
+
|
12 |
+
Pre-trained CodeT5 model. It was introduced in the paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models
|
13 |
+
for Code Understanding and Generation](https://arxiv.org/abs/2109.00859) by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi and first released in [this repository](https://github.com/salesforce/CodeT5).
|
14 |
+
|
15 |
+
Disclaimer: The team releasing CodeT5 did not write a model card for this model so this model card has been written by the Hugging Face team (more specifically, [nielsr](https://huggingface.co/nielsr)).
|
16 |
+
|
17 |
+
## Model description
|
18 |
+
|
19 |
+
From the abstract:
|
20 |
+
|
21 |
+
"We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code."
|
22 |
+
|
23 |
+
## Intended uses & limitations
|
24 |
+
|
25 |
+
This repository contains the pre-trained model only, so you can use this model for (among other tasks) masked span prediction, as shown in the code example below. However, the main use of this model is to fine-tune it for a downstream task of interest, such as:
|
26 |
+
* code summarization
|
27 |
+
* code generation
|
28 |
+
* code translation
|
29 |
+
* code refinement
|
30 |
+
* code defect detection
|
31 |
+
* code clone detection.
|
32 |
+
|
33 |
+
Supervised datasets for code can be found [here](https://huggingface.co/datasets?languages=languages:code).
|
34 |
+
See the [model hub](https://huggingface.co/models?search=salesforce/codet) to look for fine-tuned versions on a task that interests you.
|
35 |
+
|
36 |
+
### How to use
|
37 |
+
|
38 |
+
Here is how to use this model:
|
39 |
+
|
40 |
+
```python
|
41 |
+
from transformers import RobertaTokenizer, T5ForConditionalGeneration
|
42 |
+
|
43 |
+
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
|
44 |
+
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')
|
45 |
+
|
46 |
+
text = "def greet(user): print(f'hello <extra_id_0>!')"
|
47 |
+
input_ids = tokenizer(text, return_tensors="pt").input_ids
|
48 |
+
|
49 |
+
# simply generate a single sequence
|
50 |
+
generated_ids = model.generate(input_ids, max_length=8)
|
51 |
+
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
|
52 |
+
# this prints "{user.username}"
|
53 |
+
```
|
54 |
+
|
55 |
+
## Training data
|
56 |
+
|
57 |
+
The CodeT5 model was pretrained on CodeSearchNet [Husain et al., 2019](https://arxiv.org/abs/1909.09436). Additionally, the authors collected two datasets of C/CSharp from [BigQuery1](https://console.cloud.google.com/marketplace/details/github/github-repos) to ensure that all downstream tasks have overlapped programming languages with the pre-training data. In total, around 8.35 million instances are used for pretraining.
|
58 |
+
|
59 |
+
## Training procedure
|
60 |
+
|
61 |
+
### Preprocessing
|
62 |
+
|
63 |
+
This model uses a code-specific BPE (Byte-Pair Encoding) tokenizer trained using the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) library. One can prepare text (or code) for the model using RobertaTokenizer, with the files from this repository.
|
64 |
+
|
65 |
+
## Evaluation results
|
66 |
+
|
67 |
+
For evaluation results on several downstream benchmarks, we refer to the paper.
|
68 |
+
|
69 |
+
### BibTeX entry and citation info
|
70 |
+
|
71 |
+
```bibtex
|
72 |
+
@misc{wang2021codet5,
|
73 |
+
title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
|
74 |
+
author={Yue Wang and Weishi Wang and Shafiq Joty and Steven C. H. Hoi},
|
75 |
+
year={2021},
|
76 |
+
eprint={2109.00859},
|
77 |
+
archivePrefix={arXiv},
|
78 |
+
primaryClass={cs.CL}
|
79 |
+
}
|
80 |
+
```
|
added_tokens.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{}
|
config.json
ADDED
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "/content/drive/MyDrive/CodeT5/pretrained_models/codet5_base",
|
3 |
+
"architectures": [
|
4 |
+
"T5ForConditionalGeneration"
|
5 |
+
],
|
6 |
+
"bos_token_id": 1,
|
7 |
+
"d_ff": 3072,
|
8 |
+
"d_kv": 64,
|
9 |
+
"d_model": 768,
|
10 |
+
"decoder_start_token_id": 0,
|
11 |
+
"dropout_rate": 0.1,
|
12 |
+
"eos_token_id": 2,
|
13 |
+
"feed_forward_proj": "relu",
|
14 |
+
"gradient_checkpointing": false,
|
15 |
+
"id2label": {
|
16 |
+
"0": "LABEL_0"
|
17 |
+
},
|
18 |
+
"initializer_factor": 1.0,
|
19 |
+
"is_encoder_decoder": true,
|
20 |
+
"label2id": {
|
21 |
+
"LABEL_0": 0
|
22 |
+
},
|
23 |
+
"layer_norm_epsilon": 1e-06,
|
24 |
+
"model_type": "t5",
|
25 |
+
"n_positions": 512,
|
26 |
+
"num_decoder_layers": 12,
|
27 |
+
"num_heads": 12,
|
28 |
+
"num_layers": 12,
|
29 |
+
"output_past": true,
|
30 |
+
"pad_token_id": 0,
|
31 |
+
"relative_attention_num_buckets": 32,
|
32 |
+
"task_specific_params": {
|
33 |
+
"summarization": {
|
34 |
+
"early_stopping": true,
|
35 |
+
"length_penalty": 2.0,
|
36 |
+
"max_length": 200,
|
37 |
+
"min_length": 30,
|
38 |
+
"no_repeat_ngram_size": 3,
|
39 |
+
"num_beams": 4,
|
40 |
+
"prefix": "summarize: "
|
41 |
+
},
|
42 |
+
"translation_en_to_de": {
|
43 |
+
"early_stopping": true,
|
44 |
+
"max_length": 300,
|
45 |
+
"num_beams": 4,
|
46 |
+
"prefix": "translate English to German: "
|
47 |
+
},
|
48 |
+
"translation_en_to_fr": {
|
49 |
+
"early_stopping": true,
|
50 |
+
"max_length": 300,
|
51 |
+
"num_beams": 4,
|
52 |
+
"prefix": "translate English to French: "
|
53 |
+
},
|
54 |
+
"translation_en_to_ro": {
|
55 |
+
"early_stopping": true,
|
56 |
+
"max_length": 300,
|
57 |
+
"num_beams": 4,
|
58 |
+
"prefix": "translate English to Romanian: "
|
59 |
+
}
|
60 |
+
},
|
61 |
+
"torch_dtype": "float32",
|
62 |
+
"transformers_version": "4.10.2",
|
63 |
+
"use_cache": true,
|
64 |
+
"vocab_size": 32100
|
65 |
+
}
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a821f8612a1ab4affa449798d0173bde91a931fd613739c51aac7c9360d040f1
|
3 |
+
size 891681429
|
special_tokens_map.json
ADDED
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": {
|
3 |
+
"content": "<s>",
|
4 |
+
"single_word": false,
|
5 |
+
"lstrip": false,
|
6 |
+
"rstrip": false,
|
7 |
+
"normalized": true
|
8 |
+
},
|
9 |
+
"eos_token": {
|
10 |
+
"content": "</s>",
|
11 |
+
"single_word": false,
|
12 |
+
"lstrip": false,
|
13 |
+
"rstrip": false,
|
14 |
+
"normalized": true
|
15 |
+
},
|
16 |
+
"unk_token": {
|
17 |
+
"content": "<unk>",
|
18 |
+
"single_word": false,
|
19 |
+
"lstrip": false,
|
20 |
+
"rstrip": false,
|
21 |
+
"normalized": true
|
22 |
+
},
|
23 |
+
"sep_token": {
|
24 |
+
"content": "</s>",
|
25 |
+
"single_word": false,
|
26 |
+
"lstrip": false,
|
27 |
+
"rstrip": false,
|
28 |
+
"normalized": true
|
29 |
+
},
|
30 |
+
"pad_token": {
|
31 |
+
"content": "<pad>",
|
32 |
+
"single_word": false,
|
33 |
+
"lstrip": false,
|
34 |
+
"rstrip": false,
|
35 |
+
"normalized": true
|
36 |
+
},
|
37 |
+
"cls_token": {
|
38 |
+
"content": "<s>",
|
39 |
+
"single_word": false,
|
40 |
+
"lstrip": false,
|
41 |
+
"rstrip": false,
|
42 |
+
"normalized": true
|
43 |
+
},
|
44 |
+
"mask_token": { "content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
45 |
+
"additional_special_tokens": [
|
46 |
+
{ "content":"<extra_id_99>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
47 |
+
{ "content":"<extra_id_98>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
48 |
+
{ "content":"<extra_id_97>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
49 |
+
{ "content":"<extra_id_96>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
50 |
+
{ "content":"<extra_id_95>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
51 |
+
{ "content":"<extra_id_94>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
52 |
+
{ "content":"<extra_id_93>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
53 |
+
{ "content":"<extra_id_92>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
54 |
+
{ "content":"<extra_id_91>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
55 |
+
{ "content":"<extra_id_90>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
56 |
+
{ "content":"<extra_id_89>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
57 |
+
{ "content":"<extra_id_88>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
58 |
+
{ "content":"<extra_id_87>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
59 |
+
{ "content":"<extra_id_86>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
60 |
+
{ "content":"<extra_id_85>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
61 |
+
{ "content":"<extra_id_84>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
62 |
+
{ "content":"<extra_id_83>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
63 |
+
{ "content":"<extra_id_82>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
64 |
+
{ "content":"<extra_id_81>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
65 |
+
{ "content":"<extra_id_80>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
66 |
+
{ "content":"<extra_id_79>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
67 |
+
{ "content":"<extra_id_78>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
68 |
+
{ "content":"<extra_id_77>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
69 |
+
{ "content":"<extra_id_76>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
70 |
+
{ "content":"<extra_id_75>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
71 |
+
{ "content":"<extra_id_74>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
72 |
+
{ "content":"<extra_id_73>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
73 |
+
{ "content":"<extra_id_72>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
74 |
+
{ "content":"<extra_id_71>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
75 |
+
{ "content":"<extra_id_70>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
76 |
+
{ "content":"<extra_id_69>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
77 |
+
{ "content":"<extra_id_68>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
78 |
+
{ "content":"<extra_id_67>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
79 |
+
{ "content":"<extra_id_66>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
80 |
+
{ "content":"<extra_id_65>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
81 |
+
{ "content":"<extra_id_64>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
82 |
+
{ "content":"<extra_id_63>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
83 |
+
{ "content":"<extra_id_62>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
84 |
+
{ "content":"<extra_id_61>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
85 |
+
{ "content":"<extra_id_60>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
86 |
+
{ "content":"<extra_id_59>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
87 |
+
{ "content":"<extra_id_58>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
88 |
+
{ "content":"<extra_id_57>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
89 |
+
{ "content":"<extra_id_56>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
90 |
+
{ "content":"<extra_id_55>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
91 |
+
{ "content":"<extra_id_54>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
92 |
+
{ "content":"<extra_id_53>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
93 |
+
{ "content":"<extra_id_52>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
94 |
+
{ "content":"<extra_id_51>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
95 |
+
{ "content":"<extra_id_50>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
96 |
+
{ "content":"<extra_id_49>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
97 |
+
{ "content":"<extra_id_48>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
98 |
+
{ "content":"<extra_id_47>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
99 |
+
{ "content":"<extra_id_46>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
100 |
+
{ "content":"<extra_id_45>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
101 |
+
{ "content":"<extra_id_44>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
102 |
+
{ "content":"<extra_id_43>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
103 |
+
{ "content":"<extra_id_42>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
104 |
+
{ "content":"<extra_id_41>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
105 |
+
{ "content":"<extra_id_40>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
106 |
+
{ "content":"<extra_id_39>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
107 |
+
{ "content":"<extra_id_38>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
108 |
+
{ "content":"<extra_id_37>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
109 |
+
{ "content":"<extra_id_36>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
110 |
+
{ "content":"<extra_id_35>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
111 |
+
{ "content":"<extra_id_34>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
112 |
+
{ "content":"<extra_id_33>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
113 |
+
{ "content":"<extra_id_32>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
114 |
+
{ "content":"<extra_id_31>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
115 |
+
{ "content":"<extra_id_30>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
116 |
+
{ "content":"<extra_id_29>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
117 |
+
{ "content":"<extra_id_28>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
118 |
+
{ "content":"<extra_id_27>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
119 |
+
{ "content":"<extra_id_26>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
120 |
+
{ "content":"<extra_id_25>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
121 |
+
{ "content":"<extra_id_24>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
122 |
+
{ "content":"<extra_id_23>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
123 |
+
{ "content":"<extra_id_22>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
124 |
+
{ "content":"<extra_id_21>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
125 |
+
{ "content":"<extra_id_20>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
126 |
+
{ "content":"<extra_id_19>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
127 |
+
{ "content":"<extra_id_18>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
128 |
+
{ "content":"<extra_id_17>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
129 |
+
{ "content":"<extra_id_16>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
130 |
+
{ "content":"<extra_id_15>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
131 |
+
{ "content":"<extra_id_14>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
132 |
+
{ "content":"<extra_id_13>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
133 |
+
{ "content":"<extra_id_12>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
134 |
+
{ "content":"<extra_id_11>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
135 |
+
{ "content":"<extra_id_10>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
136 |
+
{ "content":"<extra_id_9>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
137 |
+
{ "content":"<extra_id_8>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
138 |
+
{ "content":"<extra_id_7>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
139 |
+
{ "content":"<extra_id_6>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
140 |
+
{ "content":"<extra_id_5>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
141 |
+
{ "content":"<extra_id_4>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
142 |
+
{ "content":"<extra_id_3>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
143 |
+
{ "content":"<extra_id_2>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
144 |
+
{ "content":"<extra_id_1>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
|
145 |
+
{ "content":"<extra_id_0>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true }
|
146 |
+
]
|
147 |
+
}
|
tokenizer_config.json
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"errors": "replace",
|
3 |
+
"unk_token": {
|
4 |
+
"content": "<unk>",
|
5 |
+
"single_word": false,
|
6 |
+
"lstrip": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"normalized": true,
|
9 |
+
"__type": "AddedToken"
|
10 |
+
},
|
11 |
+
"bos_token": {
|
12 |
+
"content": "<s>",
|
13 |
+
"single_word": false,
|
14 |
+
"lstrip": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"normalized": true,
|
17 |
+
"__type": "AddedToken"
|
18 |
+
},
|
19 |
+
"eos_token": {
|
20 |
+
"content": "</s>",
|
21 |
+
"single_word": false,
|
22 |
+
"lstrip": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"normalized": true,
|
25 |
+
"__type": "AddedToken"
|
26 |
+
},
|
27 |
+
"add_prefix_space": false,
|
28 |
+
"sep_token": {
|
29 |
+
"content": "</s>",
|
30 |
+
"single_word": false,
|
31 |
+
"lstrip": false,
|
32 |
+
"rstrip": false,
|
33 |
+
"normalized": true,
|
34 |
+
"__type": "AddedToken"
|
35 |
+
},
|
36 |
+
"cls_token": {
|
37 |
+
"content": "<s>",
|
38 |
+
"single_word": false,
|
39 |
+
"lstrip": false,
|
40 |
+
"rstrip": false,
|
41 |
+
"normalized": true,
|
42 |
+
"__type": "AddedToken"
|
43 |
+
},
|
44 |
+
"pad_token": {
|
45 |
+
"content": "<pad>",
|
46 |
+
"single_word": false,
|
47 |
+
"lstrip": false,
|
48 |
+
"rstrip": false,
|
49 |
+
"normalized": true,
|
50 |
+
"__type": "AddedToken"
|
51 |
+
},
|
52 |
+
"mask_token": {
|
53 |
+
"content": "<mask>",
|
54 |
+
"single_word": false,
|
55 |
+
"lstrip": true,
|
56 |
+
"rstrip": false,
|
57 |
+
"normalized": true,
|
58 |
+
"__type": "AddedToken"
|
59 |
+
},
|
60 |
+
"model_max_length": 512,
|
61 |
+
"tokenizer_class": "RobertaTokenizer"
|
62 |
+
}
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|