metadata
language:
- dutch
tags:
- seq2seq
- text-generation
datasets:
- mc4
t5-base-dutch
Created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace and TPU usage sponsored by Google, for the project Pre-train T5 from scratch in Dutch.
See also the fine-tuned t5-base-dutch-demo model, and the demo application Netherformer 📰, that are based on this model.
Dataset
This model was trained on a cleaned version of the Dutch part of mC4.
See the clean
directory for the clean script.
- Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
- Sentences with less than 3 words are removed
- Sentences with a word of more than 1000 characters are removed
- Documents with less than 5 sentences are removed
- Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
Training
The model was trained for 63000 steps with a batch size of 128, ending with an evaluation loss of 1.79 and accuracy of 0.64.