Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- dutch
|
4 |
+
tags:
|
5 |
+
- seq2seq
|
6 |
+
- text-generation
|
7 |
+
datasets:
|
8 |
+
- mc4
|
9 |
+
---
|
10 |
+
|
11 |
+
# t5-base-dutch
|
12 |
+
> This model was created during the
|
13 |
+
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
14 |
+
Want to give it a try? Head over to Hugging Face Spaces [here](https://huggingface.co/spaces/flax-community/netherformer).
|
15 |
+
|
16 |
+
See also the fine-tuned [t5-base-dutch-demo](https://huggingface.co/flax-community/t5-base-dutch-demo) model, that is based on this model.
|
17 |
+
|
18 |
+
## Dataset
|
19 |
+
|
20 |
+
This model was trained on a cleaned version of C4.
|
21 |
+
See the `clean` directory for the clean script.
|
22 |
+
|
23 |
+
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
|
24 |
+
* Sentences with less than 3 words are removed
|
25 |
+
* Sentences with a word of more than 1000 characters are removed
|
26 |
+
* Documents with less than 5 sentences are removed
|
27 |
+
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
28 |
+
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
29 |
+
|
30 |
+
## Training
|
31 |
+
|
32 |
+
The model was trained for 63000 steps with a batch size of 128, ending in a evaluation loss = 1.79 and accuracy = 0.64.
|