t5-base-dutch / README.md
yhavinga's picture
Update README.md
2c7b7d9
|
raw
history blame
2.51 kB
---
language:
- dutch
tags:
- seq2seq
- text-generation
datasets:
- mc4
---
# t5-base-dutch
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/) & [Dat Nguyen](https://www.linkedin.com/in/dat-nguyen-49a641138/) during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
See also the fine-tuned [t5-base-dutch-demo](https://huggingface.co/flax-community/t5-base-dutch-demo) model, and the demo application **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)**, that are based on this model.
## Dataset
This model was trained on a cleaned version of the Dutch part of [mC4](https://huggingface.co/datasets/mc4).
See the `clean` directory for the clean script.
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
* Sentences with less than 3 words are removed
* Sentences with a word of more than 1000 characters are removed
* Documents with less than 5 sentences are removed
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
## Training
Training of the model was resumed from an earlier checkpoint several times, as can be seen in the training metrics tab. (switch to wall time for a better view).
After several hours of training an error would be raised that we haven't been able to identify and solve. As a workaround,
the first few resumes would start again at step 0 with a different seeded reshuffling of the data.
In the last two resumes the random seed was fixed, and training would resume at the previous step, since a try/except around the failing example would allow training to continue in the case of errors caused by a single example.
The final model was trained for 63000 steps with a batch size of 128, ending with an evaluation loss of 1.79 and accuracy of 0.64.
A triangle learning rate schedule was used, with peak learning rate 0.01 for the first few runs, and 0.001 for the last two runs.