|
--- |
|
language: |
|
- dutch |
|
tags: |
|
- seq2seq |
|
- text-generation |
|
datasets: |
|
- mc4 |
|
--- |
|
|
|
# t5-base-dutch |
|
> This model was created during the |
|
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google. |
|
Want to give it a try? Head over to Hugging Face Spaces [here](https://huggingface.co/spaces/flax-community/netherformer). |
|
|
|
See also the fine-tuned [t5-base-dutch-demo](https://huggingface.co/flax-community/t5-base-dutch-demo) model, that is based on this model. |
|
|
|
## Dataset |
|
|
|
This model was trained on a cleaned version of C4. |
|
See the `clean` directory for the clean script. |
|
|
|
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed |
|
* Sentences with less than 3 words are removed |
|
* Sentences with a word of more than 1000 characters are removed |
|
* Documents with less than 5 sentences are removed |
|
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", |
|
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed. |
|
|
|
## Training |
|
|
|
The model was trained for 63000 steps with a batch size of 128, ending in a evaluation loss = 1.79 and accuracy = 0.64. |
|
|