---
language: 
- dutch
tags:
- seq2seq
- text-generation
datasets:
- mc4
---

# t5-base-dutch 
> This is part of the
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
Want to give it a try? Then head over to the Hugging Face Spaces for the [Netherformer](https://huggingface.co/spaces/flax-community/netherformer) example application.

See also the fine-tuned [t5-base-dutch-demo](https://huggingface.co/flax-community/t5-base-dutch-demo) model, that is based on this model.

## Dataset

This model was trained on a cleaned version of the Dutch part of [mC4](https://huggingface.co/datasets/mc4).
See the `clean` directory for the clean script.

  * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
  * Sentences with less than 3 words are removed
  * Sentences with a word of more than 1000 characters are removed
  * Documents with less than 5 sentences are removed
  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
    "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

## Training

The model was trained for 63000 steps with a batch size of 128, ending with an evaluation loss of 1.79 and accuracy of 0.64.