language:
- dutch
tags:
- seq2seq
- text-generation
datasets:
- mc4
t5-base-dutch
Created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.
See also the fine-tuned t5-base-dutch-demo model, and the demo application Netherformer 📰, that are based on this model.
Dataset
This model was trained on a cleaned version of the Dutch part of mC4.
See the clean
directory for the clean script.
- Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
- Sentences with less than 3 words are removed
- Sentences with a word of more than 1000 characters are removed
- Documents with less than 5 sentences are removed
- Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
Training
Training of the model was resumed from an earlier checkpoint several times, as can be seen in the training metrics tab. (switch to wall time for a better view).
After several hours of training an error would be raised that we haven't been able to identify and solve. As a workaround, the first few resumes would start again at step 0 with a different seeded reshuffling of the data. In the last two resumes the random seed was fixed, and training would resume at the previous step, since a try/except around the failing example would allow training to continue in the case of errors caused by a single example.
The final model was trained for 63000 steps with a batch size of 128, ending with an evaluation loss of 1.79 and accuracy of 0.64. A triangle learning rate schedule was used, with peak learning rate 0.01 for the first few runs, and 0.001 for the last two runs.