Add files
Browse files- README.md +192 -0
- added_tokens.json +1 -0
- config.gin +150 -0
- config.json +32 -0
- flax_model.msgpack +3 -0
- model-info.txt +0 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +107 -0
- spiece.model +3 -0
- spiece.vocab +0 -0
- tokenizer_config.json +113 -0
- train/events.out.tfevents.1673306830.t1v-n-a765f9c4-w-0.2573516.0.v2 +3 -0
- train/events.out.tfevents.1673803251.t1v-n-a765f9c4-w-0.20401.0.v2 +3 -0
- training_eval/mc4_en_nl_ul2_denoising/events.out.tfevents.1673306830.t1v-n-a765f9c4-w-0.2573516.1.v2 +3 -0
- training_eval/mc4_en_nl_ul2_denoising/events.out.tfevents.1673803251.t1v-n-a765f9c4-w-0.20401.1.v2 +3 -0
- training_eval/ul2_en_nl_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673306830.t1v-n-a765f9c4-w-0.2573516.2.v2 +3 -0
- training_eval/ul2_en_nl_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673803251.t1v-n-a765f9c4-w-0.20401.2.v2 +3 -0
README.md
ADDED
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
---
|
3 |
+
language:
|
4 |
+
- nl
|
5 |
+
- en
|
6 |
+
- multilingual
|
7 |
+
license: apache-2.0
|
8 |
+
tags:
|
9 |
+
- dutch
|
10 |
+
- english
|
11 |
+
- t5
|
12 |
+
- t5x
|
13 |
+
- ul2
|
14 |
+
- seq2seq
|
15 |
+
datasets:
|
16 |
+
- yhavinga/mc4_nl_cleaned
|
17 |
+
- yhavinga/nedd_wiki_news
|
18 |
+
inference: false
|
19 |
+
---
|
20 |
+
|
21 |
+
# ul2-base-dutch-english for Dutch and English
|
22 |
+
|
23 |
+
Pretrained T5 model on Dutch and English using a UL2 (Mixture-of-Denoisers) objective.
|
24 |
+
The T5 model was introduced in
|
25 |
+
[this paper](https://arxiv.org/abs/1910.10683)
|
26 |
+
and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
|
27 |
+
The UL2 objective was introduced in
|
28 |
+
[this paper](https://arxiv.org/abs/2205.05131)
|
29 |
+
and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
|
30 |
+
|
31 |
+
**Note:** The Hugging Face inference widget is deactivated because this model needs a text-to-text fine-tuning on
|
32 |
+
a specific downstream task to be useful in practice.
|
33 |
+
|
34 |
+
## Model description
|
35 |
+
|
36 |
+
T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
|
37 |
+
`ul2-base-dutch-english` T5 is a transformers model pretrained on a very large corpus of
|
38 |
+
Dutch and English data in a self-supervised fashion.
|
39 |
+
This means it was pretrained on the raw texts only, with no humans labelling them in any way
|
40 |
+
(which is why it can use lots of publicly available data) with an automatic process to generate
|
41 |
+
inputs and outputs from those texts.
|
42 |
+
|
43 |
+
|
44 |
+
This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
|
45 |
+
- GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
|
46 |
+
- Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
|
47 |
+
- Pre-trained on self-supervised objective only without mixing in the downstream tasks
|
48 |
+
- No parameter sharing between embedding and classifier layer
|
49 |
+
|
50 |
+
|
51 |
+
|
52 |
+
### UL2 pretraining objective
|
53 |
+
|
54 |
+
This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
|
55 |
+
paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
|
56 |
+
the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
|
57 |
+
that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
|
58 |
+
three denoising tasks:
|
59 |
+
|
60 |
+
1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
|
61 |
+
2. X-denoising (or extreme span corruption); and
|
62 |
+
3. S-denoising (or sequential PrefixLM).
|
63 |
+
|
64 |
+
During pre-training, we sample from the available denoising tasks based on user-specified ratios.
|
65 |
+
UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
|
66 |
+
denoising task. During the pre-training, a paradigm token is inserted to the input
|
67 |
+
(`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
|
68 |
+
Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
|
69 |
+
fine-tuning tasks.
|
70 |
+
|
71 |
+
## Intended uses & limitations
|
72 |
+
|
73 |
+
This model was only pretrained in a self-supervised way excluding any supervised training.
|
74 |
+
Therefore, this model has to be fine-tuned before it is usable on a downstream task,
|
75 |
+
like text classification, unlike the Google's original T5 model.
|
76 |
+
|
77 |
+
**Note:** You most likely need to fine-tune these T5/UL2 models without mixed precision
|
78 |
+
so fine-tune them with full fp32 precision. Fine-tuning with Flax in bf16 - `model.to_bf16()` - is possible
|
79 |
+
if you set the mask correctly to exclude layernorm and embedding layers. Also note that the T5x pre-training
|
80 |
+
and fine-tuning configs set `z_loss` to 1e-4, which is used to keep the loss scale from underflowing.
|
81 |
+
You can also find more fine-tuning tips from [here](https://discuss.huggingface.co/t/t5-finetuning-tips), for example.
|
82 |
+
|
83 |
+
**Note**: For fine-tuning, most likely you can get better results if you insert a prefix token
|
84 |
+
of `[NLU]`, `[NLG]`, or `[S2S]` to your input texts.
|
85 |
+
For general language understanding fine-tuning tasks, you could use the `[NLU]` token.
|
86 |
+
For GPT-style causal language generation, you could use the `[S2S]` token.
|
87 |
+
The token `[NLG]` of the X-denoising pretrain task is somewhat mix between the language understanding and causal language
|
88 |
+
generation so the token `[NLG]` could maybe be used for language generation fine-tuning too.
|
89 |
+
|
90 |
+
### How to use
|
91 |
+
|
92 |
+
Here is how to use this model in PyTorch:
|
93 |
+
|
94 |
+
```python
|
95 |
+
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
96 |
+
|
97 |
+
tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-base-dutch-english", use_fast=False)
|
98 |
+
model = T5ForConditionalGeneration.from_pretrained("yhavinga/ul2-base-dutch-english")
|
99 |
+
```
|
100 |
+
|
101 |
+
and in Flax:
|
102 |
+
|
103 |
+
```python
|
104 |
+
from transformers import T5Tokenizer, FlaxT5ForConditionalGeneration
|
105 |
+
|
106 |
+
tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-base-dutch-english", use_fast=False)
|
107 |
+
model = FlaxT5ForConditionalGeneration.from_pretrained("yhavinga/ul2-base-dutch-english")
|
108 |
+
```
|
109 |
+
|
110 |
+
|
111 |
+
### Limitations and bias
|
112 |
+
|
113 |
+
The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
|
114 |
+
Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
|
115 |
+
|
116 |
+
## Training data
|
117 |
+
|
118 |
+
The `ul2-base-dutch-english` T5 model was pre-trained simultaneously on a combination of several datasets,
|
119 |
+
including the `full_en_nl` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
|
120 |
+
crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), the English subset of Wikipedia (2022-03-01),
|
121 |
+
and a subset of "mc4_nl_cleaned"
|
122 |
+
containing only texts from Dutch and Belgian newspapers. This last dataset is oversampled to bias the model
|
123 |
+
towards descriptions of events in the Netherlands and Belgium.
|
124 |
+
|
125 |
+
|
126 |
+
|
127 |
+
## Training procedure
|
128 |
+
|
129 |
+
### Preprocessing
|
130 |
+
|
131 |
+
The ul2-base-dutch-english T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
|
132 |
+
The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
|
133 |
+
`[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
|
134 |
+
During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
|
135 |
+
The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
|
136 |
+
between `dutch` and `Dutch`.
|
137 |
+
Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
|
138 |
+
|
139 |
+
### Pretraining
|
140 |
+
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/),
|
141 |
+
for 1000000 steps with a batch size of 128
|
142 |
+
(in total 65 B tokens).
|
143 |
+
The optimizer used was AdaFactor with learning rate warmup for 10K steps with a constant learning rate of 1e-2,
|
144 |
+
and then an inverse square root decay (exponential decay) of the learning rate after.
|
145 |
+
The model was trained with Google's Jax/Flax based [t5x framework](https://github.com/google-research/t5x) with help
|
146 |
+
from [Stephenn Fernandes](https://huggingface.co/StephennFernandes) to get started writing task definitions that wrap
|
147 |
+
HF datasets.
|
148 |
+
|
149 |
+
The UL2 training objective code used with the [t5x framework](https://github.com/google-research/t5x) was copied and
|
150 |
+
slightly modified from the [UL2 paper](https://arxiv.org/pdf/2205.05131.pdf) appendix chapter 9.2 by the authors
|
151 |
+
of the Finnish ul2 models. Used UL2 objective code is available in the repository
|
152 |
+
[Finnish-NLP/ul2-base-nl36-finnish](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) in the files `ul2_objective.py` and `tasks.py`.
|
153 |
+
UL2's mixture-of-denoisers configuration was otherwise equal to the UL2 paper
|
154 |
+
but for the rate of mixing denoisers, 20% for S-denoising was used (suggested at the paper chapter 4.5)
|
155 |
+
and the rest was divided equally between the R-denoising and X-denoising (i.e. 40% for both).
|
156 |
+
### Model list
|
157 |
+
|
158 |
+
Models in this series:
|
159 |
+
|
160 |
+
| | ul2-base-dutch-english | ul2-large-dutch-english | ul2-small-dutch-english |
|
161 |
+
|:---------------------|:-------------------------|:--------------------------|:--------------------------|
|
162 |
+
| model_type | t5 | t5 | t5 |
|
163 |
+
| _pipeline_tag | text2text-generation | text2text-generation | text2text-generation |
|
164 |
+
| d_model | 768 | 1024 | 512 |
|
165 |
+
| d_ff | 2048 | 2816 | 1024 |
|
166 |
+
| num_heads | 12 | 16 | 6 |
|
167 |
+
| d_kv | 64 | 64 | 64 |
|
168 |
+
| num_layers | 12 | 24 | 8 |
|
169 |
+
| num_decoder_layers | 12 | 24 | 8 |
|
170 |
+
| feed_forward_proj | gated-gelu | gated-gelu | gated-gelu |
|
171 |
+
| dense_act_fn | gelu_new | gelu_new | gelu_new |
|
172 |
+
| vocab_size | 32128 | 32128 | 32128 |
|
173 |
+
| tie_word_embeddings | 0 | 0 | 0 |
|
174 |
+
| torch_dtype | float32 | float32 | float32 |
|
175 |
+
| _gin_batch_size | 128 | 64 | 128 |
|
176 |
+
| _gin_z_loss | 0.0001 | 0.0001 | 0.0001 |
|
177 |
+
| _gin_t5_config_dtype | 'bfloat16' | 'bfloat16' | 'bfloat16' |
|
178 |
+
|
179 |
+
|
180 |
+
## Evaluation results
|
181 |
+
|
182 |
+
See the evaluation section in the interactive [Pre-training Dutch T5 Models](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models) blog.
|
183 |
+
|
184 |
+
## Acknowledgements
|
185 |
+
|
186 |
+
This project would not have been possible without compute generously provided by Google through the
|
187 |
+
[TPU Research Cloud](https://sites.research.google/trc/).
|
188 |
+
Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
|
189 |
+
Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
|
190 |
+
|
191 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|
192 |
+
|
added_tokens.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"[new_id_17]": 32117, "[new_id_20]": 32120, "[new_id_13]": 32113, "[new_id_2]": 32102, "[new_id_16]": 32116, "[new_id_7]": 32107, "[new_id_5]": 32105, "[new_id_1]": 32101, "[new_id_15]": 32115, "[new_id_12]": 32112, "[new_id_0]": 32100, "[new_id_11]": 32111, "[new_id_25]": 32125, "[new_id_24]": 32124, "[new_id_10]": 32110, "[new_id_27]": 32127, "[new_id_23]": 32123, "[new_id_14]": 32114, "[new_id_22]": 32122, "[new_id_21]": 32121, "[new_id_19]": 32119, "[new_id_3]": 32103, "[new_id_4]": 32104, "[new_id_18]": 32118, "[new_id_9]": 32109, "[new_id_8]": 32108, "[new_id_26]": 32126, "[new_id_6]": 32106}
|
config.gin
ADDED
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from __gin__ import dynamic_registration
|
2 |
+
import __main__ as train_script
|
3 |
+
import seqio
|
4 |
+
import t5.data.mixtures
|
5 |
+
from t5x import adafactor
|
6 |
+
from t5x.examples.t5 import network
|
7 |
+
from t5x import gin_utils
|
8 |
+
from t5x import models
|
9 |
+
from t5x import partitioning
|
10 |
+
from t5x import trainer
|
11 |
+
from t5x import utils
|
12 |
+
import tasks.nedd_tasks
|
13 |
+
import tasks.ul2_tasks as tasks2
|
14 |
+
|
15 |
+
# Macros:
|
16 |
+
# ==============================================================================
|
17 |
+
BATCH_SIZE = 128
|
18 |
+
DROPOUT_RATE = 0.0
|
19 |
+
LABEL_SMOOTHING = 0.0
|
20 |
+
LOSS_NORMALIZING_FACTOR = None
|
21 |
+
MIXTURE_OR_TASK_MODULE = None
|
22 |
+
MIXTURE_OR_TASK_NAME = 'ul2_en_nl_mc4_nedd_wiki_news_mix_1'
|
23 |
+
MODEL = @models.EncoderDecoderModel()
|
24 |
+
MODEL_DIR = 'ul2_base_en_nl_mc4_nedd_wiki_news_nl'
|
25 |
+
OPTIMIZER = @adafactor.Adafactor()
|
26 |
+
RANDOM_SEED = None
|
27 |
+
SHUFFLE_TRAIN_EXAMPLES = True
|
28 |
+
TASK_FEATURE_LENGTHS = {'inputs': 512, 'targets': 512}
|
29 |
+
TRAIN_STEPS = 1000000
|
30 |
+
USE_CACHED_TASKS = False
|
31 |
+
USE_HARDWARE_RNG = False
|
32 |
+
VOCABULARY = @seqio.SentencePieceVocabulary()
|
33 |
+
Z_LOSS = 0.0001
|
34 |
+
|
35 |
+
# Parameters for adafactor.Adafactor:
|
36 |
+
# ==============================================================================
|
37 |
+
adafactor.Adafactor.decay_rate = 0.8
|
38 |
+
adafactor.Adafactor.logical_factor_rules = \
|
39 |
+
@adafactor.standard_logical_factor_rules()
|
40 |
+
adafactor.Adafactor.step_offset = 0
|
41 |
+
|
42 |
+
# Parameters for utils.CheckpointConfig:
|
43 |
+
# ==============================================================================
|
44 |
+
utils.CheckpointConfig.restore = @utils.RestoreCheckpointConfig()
|
45 |
+
utils.CheckpointConfig.save = @utils.SaveCheckpointConfig()
|
46 |
+
|
47 |
+
# Parameters for utils.create_learning_rate_scheduler:
|
48 |
+
# ==============================================================================
|
49 |
+
utils.create_learning_rate_scheduler.base_learning_rate = 1.0
|
50 |
+
utils.create_learning_rate_scheduler.factors = 'constant * rsqrt_decay'
|
51 |
+
utils.create_learning_rate_scheduler.warmup_steps = 10000
|
52 |
+
|
53 |
+
# Parameters for train/utils.DatasetConfig:
|
54 |
+
# ==============================================================================
|
55 |
+
train/utils.DatasetConfig.batch_size = %BATCH_SIZE
|
56 |
+
train/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
|
57 |
+
train/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
|
58 |
+
train/utils.DatasetConfig.pack = True
|
59 |
+
train/utils.DatasetConfig.seed = None
|
60 |
+
train/utils.DatasetConfig.shuffle = %SHUFFLE_TRAIN_EXAMPLES
|
61 |
+
train/utils.DatasetConfig.split = 'train'
|
62 |
+
train/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
|
63 |
+
train/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
|
64 |
+
|
65 |
+
# Parameters for train_eval/utils.DatasetConfig:
|
66 |
+
# ==============================================================================
|
67 |
+
train_eval/utils.DatasetConfig.batch_size = %BATCH_SIZE
|
68 |
+
train_eval/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
|
69 |
+
train_eval/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
|
70 |
+
train_eval/utils.DatasetConfig.pack = True
|
71 |
+
train_eval/utils.DatasetConfig.seed = 42
|
72 |
+
train_eval/utils.DatasetConfig.shuffle = False
|
73 |
+
train_eval/utils.DatasetConfig.split = 'validation'
|
74 |
+
train_eval/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
|
75 |
+
train_eval/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
|
76 |
+
|
77 |
+
# Parameters for models.EncoderDecoderModel:
|
78 |
+
# ==============================================================================
|
79 |
+
models.EncoderDecoderModel.input_vocabulary = %VOCABULARY
|
80 |
+
models.EncoderDecoderModel.label_smoothing = %LABEL_SMOOTHING
|
81 |
+
models.EncoderDecoderModel.loss_normalizing_factor = %LOSS_NORMALIZING_FACTOR
|
82 |
+
models.EncoderDecoderModel.module = @network.Transformer()
|
83 |
+
models.EncoderDecoderModel.optimizer_def = %OPTIMIZER
|
84 |
+
models.EncoderDecoderModel.output_vocabulary = %VOCABULARY
|
85 |
+
models.EncoderDecoderModel.z_loss = %Z_LOSS
|
86 |
+
|
87 |
+
# Parameters for partitioning.PjitPartitioner:
|
88 |
+
# ==============================================================================
|
89 |
+
partitioning.PjitPartitioner.logical_axis_rules = \
|
90 |
+
@partitioning.standard_logical_axis_rules()
|
91 |
+
partitioning.PjitPartitioner.model_parallel_submesh = None
|
92 |
+
partitioning.PjitPartitioner.num_partitions = 1
|
93 |
+
|
94 |
+
# Parameters for utils.RestoreCheckpointConfig:
|
95 |
+
# ==============================================================================
|
96 |
+
utils.RestoreCheckpointConfig.path = []
|
97 |
+
|
98 |
+
# Parameters for utils.SaveCheckpointConfig:
|
99 |
+
# ==============================================================================
|
100 |
+
utils.SaveCheckpointConfig.dtype = 'float32'
|
101 |
+
utils.SaveCheckpointConfig.keep = 4
|
102 |
+
utils.SaveCheckpointConfig.period = 50000
|
103 |
+
utils.SaveCheckpointConfig.save_dataset = False
|
104 |
+
utils.SaveCheckpointConfig.use_gda = False
|
105 |
+
|
106 |
+
# Parameters for seqio.SentencePieceVocabulary:
|
107 |
+
# ==============================================================================
|
108 |
+
seqio.SentencePieceVocabulary.sentencepiece_model_file = \
|
109 |
+
'gs://t5-dutch-english/vocabs/nedd.32000.128extra/spiece.model'
|
110 |
+
|
111 |
+
# Parameters for network.T5Config:
|
112 |
+
# ==============================================================================
|
113 |
+
network.T5Config.dropout_rate = %DROPOUT_RATE
|
114 |
+
network.T5Config.dtype = 'bfloat16'
|
115 |
+
network.T5Config.emb_dim = 768
|
116 |
+
network.T5Config.head_dim = 64
|
117 |
+
network.T5Config.logits_via_embedding = False
|
118 |
+
network.T5Config.mlp_activations = ('gelu', 'linear')
|
119 |
+
network.T5Config.mlp_dim = 2048
|
120 |
+
network.T5Config.num_decoder_layers = 12
|
121 |
+
network.T5Config.num_encoder_layers = 12
|
122 |
+
network.T5Config.num_heads = 12
|
123 |
+
network.T5Config.vocab_size = 32128
|
124 |
+
|
125 |
+
# Parameters for train_script.train:
|
126 |
+
# ==============================================================================
|
127 |
+
train_script.train.checkpoint_cfg = @utils.CheckpointConfig()
|
128 |
+
train_script.train.eval_period = 2000
|
129 |
+
train_script.train.eval_steps = 20
|
130 |
+
train_script.train.infer_eval_dataset_cfg = None
|
131 |
+
train_script.train.model = %MODEL
|
132 |
+
train_script.train.model_dir = %MODEL_DIR
|
133 |
+
train_script.train.partitioner = @partitioning.PjitPartitioner()
|
134 |
+
train_script.train.random_seed = %RANDOM_SEED
|
135 |
+
train_script.train.stats_period = 100
|
136 |
+
train_script.train.summarize_config_fn = @gin_utils.summarize_gin_config
|
137 |
+
train_script.train.total_steps = %TRAIN_STEPS
|
138 |
+
train_script.train.train_dataset_cfg = @train/utils.DatasetConfig()
|
139 |
+
train_script.train.train_eval_dataset_cfg = @train_eval/utils.DatasetConfig()
|
140 |
+
train_script.train.trainer_cls = @trainer.Trainer
|
141 |
+
train_script.train.use_hardware_rng = %USE_HARDWARE_RNG
|
142 |
+
|
143 |
+
# Parameters for trainer.Trainer:
|
144 |
+
# ==============================================================================
|
145 |
+
trainer.Trainer.learning_rate_fn = @utils.create_learning_rate_scheduler()
|
146 |
+
trainer.Trainer.num_microbatches = None
|
147 |
+
|
148 |
+
# Parameters for network.Transformer:
|
149 |
+
# ==============================================================================
|
150 |
+
network.Transformer.config = @network.T5Config()
|
config.json
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "./",
|
3 |
+
"architectures": [
|
4 |
+
"T5ForConditionalGeneration"
|
5 |
+
],
|
6 |
+
"d_ff": 2048,
|
7 |
+
"d_kv": 64,
|
8 |
+
"d_model": 768,
|
9 |
+
"decoder_start_token_id": 0,
|
10 |
+
"dense_act_fn": "gelu_new",
|
11 |
+
"dropout_rate": 0.1,
|
12 |
+
"eos_token_id": 1,
|
13 |
+
"feed_forward_proj": "gated-gelu",
|
14 |
+
"initializer_factor": 1.0,
|
15 |
+
"is_encoder_decoder": true,
|
16 |
+
"is_gated_act": true,
|
17 |
+
"layer_norm_epsilon": 1e-06,
|
18 |
+
"model_type": "t5",
|
19 |
+
"n_positions": 512,
|
20 |
+
"num_decoder_layers": 12,
|
21 |
+
"num_heads": 12,
|
22 |
+
"num_layers": 12,
|
23 |
+
"output_past": true,
|
24 |
+
"pad_token_id": 0,
|
25 |
+
"relative_attention_max_distance": 128,
|
26 |
+
"relative_attention_num_buckets": 32,
|
27 |
+
"tie_word_embeddings": false,
|
28 |
+
"torch_dtype": "float32",
|
29 |
+
"transformers_version": "4.24.0",
|
30 |
+
"use_cache": true,
|
31 |
+
"vocab_size": 32128
|
32 |
+
}
|
flax_model.msgpack
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4b048b4c5af83da359bb5f09c6cc3acb7f8948b012b8499bf431f2145380363b
|
3 |
+
size 990323615
|
model-info.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9febb21742097fb3f8ee0ca994cc17e5ffb1ae1d0a8b6dfdddc6686b6940a04c
|
3 |
+
size 990404917
|
special_tokens_map.json
ADDED
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"<extra_id_0>",
|
4 |
+
"<extra_id_1>",
|
5 |
+
"<extra_id_2>",
|
6 |
+
"<extra_id_3>",
|
7 |
+
"<extra_id_4>",
|
8 |
+
"<extra_id_5>",
|
9 |
+
"<extra_id_6>",
|
10 |
+
"<extra_id_7>",
|
11 |
+
"<extra_id_8>",
|
12 |
+
"<extra_id_9>",
|
13 |
+
"<extra_id_10>",
|
14 |
+
"<extra_id_11>",
|
15 |
+
"<extra_id_12>",
|
16 |
+
"<extra_id_13>",
|
17 |
+
"<extra_id_14>",
|
18 |
+
"<extra_id_15>",
|
19 |
+
"<extra_id_16>",
|
20 |
+
"<extra_id_17>",
|
21 |
+
"<extra_id_18>",
|
22 |
+
"<extra_id_19>",
|
23 |
+
"<extra_id_20>",
|
24 |
+
"<extra_id_21>",
|
25 |
+
"<extra_id_22>",
|
26 |
+
"<extra_id_23>",
|
27 |
+
"<extra_id_24>",
|
28 |
+
"<extra_id_25>",
|
29 |
+
"<extra_id_26>",
|
30 |
+
"<extra_id_27>",
|
31 |
+
"<extra_id_28>",
|
32 |
+
"<extra_id_29>",
|
33 |
+
"<extra_id_30>",
|
34 |
+
"<extra_id_31>",
|
35 |
+
"<extra_id_32>",
|
36 |
+
"<extra_id_33>",
|
37 |
+
"<extra_id_34>",
|
38 |
+
"<extra_id_35>",
|
39 |
+
"<extra_id_36>",
|
40 |
+
"<extra_id_37>",
|
41 |
+
"<extra_id_38>",
|
42 |
+
"<extra_id_39>",
|
43 |
+
"<extra_id_40>",
|
44 |
+
"<extra_id_41>",
|
45 |
+
"<extra_id_42>",
|
46 |
+
"<extra_id_43>",
|
47 |
+
"<extra_id_44>",
|
48 |
+
"<extra_id_45>",
|
49 |
+
"<extra_id_46>",
|
50 |
+
"<extra_id_47>",
|
51 |
+
"<extra_id_48>",
|
52 |
+
"<extra_id_49>",
|
53 |
+
"<extra_id_50>",
|
54 |
+
"<extra_id_51>",
|
55 |
+
"<extra_id_52>",
|
56 |
+
"<extra_id_53>",
|
57 |
+
"<extra_id_54>",
|
58 |
+
"<extra_id_55>",
|
59 |
+
"<extra_id_56>",
|
60 |
+
"<extra_id_57>",
|
61 |
+
"<extra_id_58>",
|
62 |
+
"<extra_id_59>",
|
63 |
+
"<extra_id_60>",
|
64 |
+
"<extra_id_61>",
|
65 |
+
"<extra_id_62>",
|
66 |
+
"<extra_id_63>",
|
67 |
+
"<extra_id_64>",
|
68 |
+
"<extra_id_65>",
|
69 |
+
"<extra_id_66>",
|
70 |
+
"<extra_id_67>",
|
71 |
+
"<extra_id_68>",
|
72 |
+
"<extra_id_69>",
|
73 |
+
"<extra_id_70>",
|
74 |
+
"<extra_id_71>",
|
75 |
+
"<extra_id_72>",
|
76 |
+
"<extra_id_73>",
|
77 |
+
"<extra_id_74>",
|
78 |
+
"<extra_id_75>",
|
79 |
+
"<extra_id_76>",
|
80 |
+
"<extra_id_77>",
|
81 |
+
"<extra_id_78>",
|
82 |
+
"<extra_id_79>",
|
83 |
+
"<extra_id_80>",
|
84 |
+
"<extra_id_81>",
|
85 |
+
"<extra_id_82>",
|
86 |
+
"<extra_id_83>",
|
87 |
+
"<extra_id_84>",
|
88 |
+
"<extra_id_85>",
|
89 |
+
"<extra_id_86>",
|
90 |
+
"<extra_id_87>",
|
91 |
+
"<extra_id_88>",
|
92 |
+
"<extra_id_89>",
|
93 |
+
"<extra_id_90>",
|
94 |
+
"<extra_id_91>",
|
95 |
+
"<extra_id_92>",
|
96 |
+
"<extra_id_93>",
|
97 |
+
"<extra_id_94>",
|
98 |
+
"<extra_id_95>",
|
99 |
+
"<extra_id_96>",
|
100 |
+
"<extra_id_97>",
|
101 |
+
"<extra_id_98>",
|
102 |
+
"<extra_id_99>"
|
103 |
+
],
|
104 |
+
"eos_token": "</s>",
|
105 |
+
"pad_token": "<pad>",
|
106 |
+
"unk_token": "<unk>"
|
107 |
+
}
|
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:caa6e2f21aeec181276ab80273e3f869ce303ccb8602d68e0524783c3581092d
|
3 |
+
size 800223
|
spiece.vocab
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"<extra_id_0>",
|
4 |
+
"<extra_id_1>",
|
5 |
+
"<extra_id_2>",
|
6 |
+
"<extra_id_3>",
|
7 |
+
"<extra_id_4>",
|
8 |
+
"<extra_id_5>",
|
9 |
+
"<extra_id_6>",
|
10 |
+
"<extra_id_7>",
|
11 |
+
"<extra_id_8>",
|
12 |
+
"<extra_id_9>",
|
13 |
+
"<extra_id_10>",
|
14 |
+
"<extra_id_11>",
|
15 |
+
"<extra_id_12>",
|
16 |
+
"<extra_id_13>",
|
17 |
+
"<extra_id_14>",
|
18 |
+
"<extra_id_15>",
|
19 |
+
"<extra_id_16>",
|
20 |
+
"<extra_id_17>",
|
21 |
+
"<extra_id_18>",
|
22 |
+
"<extra_id_19>",
|
23 |
+
"<extra_id_20>",
|
24 |
+
"<extra_id_21>",
|
25 |
+
"<extra_id_22>",
|
26 |
+
"<extra_id_23>",
|
27 |
+
"<extra_id_24>",
|
28 |
+
"<extra_id_25>",
|
29 |
+
"<extra_id_26>",
|
30 |
+
"<extra_id_27>",
|
31 |
+
"<extra_id_28>",
|
32 |
+
"<extra_id_29>",
|
33 |
+
"<extra_id_30>",
|
34 |
+
"<extra_id_31>",
|
35 |
+
"<extra_id_32>",
|
36 |
+
"<extra_id_33>",
|
37 |
+
"<extra_id_34>",
|
38 |
+
"<extra_id_35>",
|
39 |
+
"<extra_id_36>",
|
40 |
+
"<extra_id_37>",
|
41 |
+
"<extra_id_38>",
|
42 |
+
"<extra_id_39>",
|
43 |
+
"<extra_id_40>",
|
44 |
+
"<extra_id_41>",
|
45 |
+
"<extra_id_42>",
|
46 |
+
"<extra_id_43>",
|
47 |
+
"<extra_id_44>",
|
48 |
+
"<extra_id_45>",
|
49 |
+
"<extra_id_46>",
|
50 |
+
"<extra_id_47>",
|
51 |
+
"<extra_id_48>",
|
52 |
+
"<extra_id_49>",
|
53 |
+
"<extra_id_50>",
|
54 |
+
"<extra_id_51>",
|
55 |
+
"<extra_id_52>",
|
56 |
+
"<extra_id_53>",
|
57 |
+
"<extra_id_54>",
|
58 |
+
"<extra_id_55>",
|
59 |
+
"<extra_id_56>",
|
60 |
+
"<extra_id_57>",
|
61 |
+
"<extra_id_58>",
|
62 |
+
"<extra_id_59>",
|
63 |
+
"<extra_id_60>",
|
64 |
+
"<extra_id_61>",
|
65 |
+
"<extra_id_62>",
|
66 |
+
"<extra_id_63>",
|
67 |
+
"<extra_id_64>",
|
68 |
+
"<extra_id_65>",
|
69 |
+
"<extra_id_66>",
|
70 |
+
"<extra_id_67>",
|
71 |
+
"<extra_id_68>",
|
72 |
+
"<extra_id_69>",
|
73 |
+
"<extra_id_70>",
|
74 |
+
"<extra_id_71>",
|
75 |
+
"<extra_id_72>",
|
76 |
+
"<extra_id_73>",
|
77 |
+
"<extra_id_74>",
|
78 |
+
"<extra_id_75>",
|
79 |
+
"<extra_id_76>",
|
80 |
+
"<extra_id_77>",
|
81 |
+
"<extra_id_78>",
|
82 |
+
"<extra_id_79>",
|
83 |
+
"<extra_id_80>",
|
84 |
+
"<extra_id_81>",
|
85 |
+
"<extra_id_82>",
|
86 |
+
"<extra_id_83>",
|
87 |
+
"<extra_id_84>",
|
88 |
+
"<extra_id_85>",
|
89 |
+
"<extra_id_86>",
|
90 |
+
"<extra_id_87>",
|
91 |
+
"<extra_id_88>",
|
92 |
+
"<extra_id_89>",
|
93 |
+
"<extra_id_90>",
|
94 |
+
"<extra_id_91>",
|
95 |
+
"<extra_id_92>",
|
96 |
+
"<extra_id_93>",
|
97 |
+
"<extra_id_94>",
|
98 |
+
"<extra_id_95>",
|
99 |
+
"<extra_id_96>",
|
100 |
+
"<extra_id_97>",
|
101 |
+
"<extra_id_98>",
|
102 |
+
"<extra_id_99>"
|
103 |
+
],
|
104 |
+
"eos_token": "</s>",
|
105 |
+
"extra_ids": 100,
|
106 |
+
"name_or_path": "yhavinga/ul2-base-en-nl",
|
107 |
+
"pad_token": "<pad>",
|
108 |
+
"sp_model_kwargs": {},
|
109 |
+
"special_tokens_map_file": null,
|
110 |
+
"tokenizer_class": "T5Tokenizer",
|
111 |
+
"unk_token": "<unk>",
|
112 |
+
"use_fast_tokenizer": false
|
113 |
+
}
|
train/events.out.tfevents.1673306830.t1v-n-a765f9c4-w-0.2573516.0.v2
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:27724d5787a62521849f131f6f6e67495d5ae4c849b1701c11fe27607b738aa8
|
3 |
+
size 19863952
|
train/events.out.tfevents.1673803251.t1v-n-a765f9c4-w-0.20401.0.v2
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3ee1b6b7540acaf819a356d147ccbaba6d9a208b39722bca506a3d1c88b28088
|
3 |
+
size 6114
|
training_eval/mc4_en_nl_ul2_denoising/events.out.tfevents.1673306830.t1v-n-a765f9c4-w-0.2573516.1.v2
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d3f2683b14423349bec232c4f0de73a111e1e56200ca8ccac78fd39d6994a9d7
|
3 |
+
size 879457
|
training_eval/mc4_en_nl_ul2_denoising/events.out.tfevents.1673803251.t1v-n-a765f9c4-w-0.20401.1.v2
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:af8e57c12f551772660da56a6ea6f7bb2b176f15db18d2b1306f8db62f70aa61
|
3 |
+
size 40
|
training_eval/ul2_en_nl_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673306830.t1v-n-a765f9c4-w-0.2573516.2.v2
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:630a16ddaec46a7227654fd9b03243d8a3427ee5ff49f28cacc4c878fcfe5ca3
|
3 |
+
size 879457
|
training_eval/ul2_en_nl_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673803251.t1v-n-a765f9c4-w-0.20401.2.v2
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:af8e57c12f551772660da56a6ea6f7bb2b176f15db18d2b1306f8db62f70aa61
|
3 |
+
size 40
|