Update README.md
Browse files
README.md
CHANGED
@@ -14,33 +14,67 @@ datasets:
|
|
14 |
---
|
15 |
# GPT Neo 1.3B pre-trained on cleaned Dutch mC4 🇳🇱
|
16 |
|
17 |
-
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
22 |
-
* dataset config: tiny (3B tokens)
|
23 |
-
* dataset config: large (24B tokens)
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
28 |
-
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
33 |
-
* Trained for 960K steps (batch size 16) to ppl 16,0 on mc4 nl full
|
34 |
-
* Block size: 512
|
35 |
-
* Optimizer: adafactor
|
36 |
-
* lr: 5e-5
|
37 |
-
* Warmup steps: 5000
|
38 |
|
39 |
-
|
|
|
40 |
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
---
|
15 |
# GPT Neo 1.3B pre-trained on cleaned Dutch mC4 🇳🇱
|
16 |
|
17 |
+
A GPT-Neo model trained from scratch on Dutch, with perplexity 16.0 on cleaned Dutch mC4.
|
18 |
|
19 |
+
## How To Use
|
20 |
|
21 |
+
You can use this GPT-Neo model directly with a pipeline for text generation.
|
|
|
|
|
22 |
|
23 |
+
```python
|
24 |
+
MODEL_DIR='yhavinga/gpt-neo-1.3B-dutch'
|
25 |
+
from transformers import pipeline, GPT2Tokenizer, GPTNeoForCausalLM
|
26 |
+
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
|
27 |
+
model = GPTNeoForCausalLM.from_pretrained(MODEL_DIR)
|
28 |
+
generator = pipeline('text-generation', model, tokenizer=tokenizer)
|
29 |
|
30 |
+
generated_text = generator('1 - geel. 2 - groen. 3 -', max_length=60, num_beams=4, no_repeat_ngram_size=3, repetition_penalty=2.0)
|
31 |
+
```
|
32 |
|
33 |
+
*"1 - geel. 2 - groen. 3 - rood. 4 - blauw. 5 - bruin. 6 - zwart. 7 - oranje. 8 - roze. 9 - paars. 10 - wit. 11 - grijs. 12 - magenta. 13 - lila. 14 - lichtgroen. 15"*
|
34 |
|
35 |
+
## Tokenizer
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
+
* BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
|
38 |
+
Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
|
39 |
|
40 |
+
## Dataset
|
41 |
+
|
42 |
+
This model was trained on the wikipedia and newspapers (3.9B tokens) webpages in
|
43 |
+
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
|
44 |
+
which is the original mC4, except
|
45 |
+
|
46 |
+
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
|
47 |
+
* Sentences with less than 3 words are removed
|
48 |
+
* Sentences with a word of more than 1000 characters are removed
|
49 |
+
* Documents with less than 5 sentences are removed
|
50 |
+
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
51 |
+
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
52 |
+
|
53 |
+
## Models
|
54 |
+
|
55 |
+
TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
|
56 |
+
|
57 |
+
* `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites.
|
58 |
+
* The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
|
59 |
+
|
60 |
+
| | model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config |
|
61 |
+
|-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
|
62 |
+
| [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M | 512 | 19.9 | 2.99 | 128 | 8 | 558608 | adamw | 2.4e-3 | 1d 12h | news+wiki |
|
63 |
+
| [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 4 | 320000/520502 | adafactor | 8e-4 | 7d 2h | full |
|
64 |
+
| [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch) | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large |
|
65 |
+
| [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full |
|
66 |
+
|
67 |
+
|
68 |
+
## Acknowledgements
|
69 |
+
|
70 |
+
This project would not have been possible without compute generously provided by Google through the
|
71 |
+
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
|
72 |
+
instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
|
73 |
+
and training the models:
|
74 |
+
|
75 |
+
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
|
76 |
+
* [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
|
77 |
+
* [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
|
78 |
+
* [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
|
79 |
+
|
80 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|