yhavinga commited on
Commit
fe354ca
1 Parent(s): a4a0036

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -22
README.md CHANGED
@@ -14,33 +14,67 @@ datasets:
14
  ---
15
  # GPT Neo 1.3B pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
- *NB: Training in progress.*
18
 
19
- Dataset:
20
 
21
- * [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
22
- * dataset config: tiny (3B tokens)
23
- * dataset config: large (24B tokens)
24
 
25
- Tokenizer:
 
 
 
 
 
26
 
27
- * Tokenizer trained on mC4 with scripts from the Huggingface
28
- Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
29
 
30
- Training details:
31
 
32
- * Trained for 70K steps (batch size 64) to ppl 27 on mc4 nl tiny 1 epoch
33
- * Trained for 960K steps (batch size 16) to ppl 16,0 on mc4 nl full
34
- * Block size: 512
35
- * Optimizer: adafactor
36
- * lr: 5e-5
37
- * Warmup steps: 5000
38
 
39
- Jan 2022
 
40
 
41
- * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
42
- * Thanks to @gsarti for creating the [t5-flax-gcp
43
- repository](https://github.com/gsarti/t5-flax-gcp).
44
- * Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
45
- [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
46
- for sharing their training scripts!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
  # GPT Neo 1.3B pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
+ A GPT-Neo model trained from scratch on Dutch, with perplexity 16.0 on cleaned Dutch mC4.
18
 
19
+ ## How To Use
20
 
21
+ You can use this GPT-Neo model directly with a pipeline for text generation.
 
 
22
 
23
+ ```python
24
+ MODEL_DIR='yhavinga/gpt-neo-1.3B-dutch'
25
+ from transformers import pipeline, GPT2Tokenizer, GPTNeoForCausalLM
26
+ tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
27
+ model = GPTNeoForCausalLM.from_pretrained(MODEL_DIR)
28
+ generator = pipeline('text-generation', model, tokenizer=tokenizer)
29
 
30
+ generated_text = generator('1 - geel. 2 - groen. 3 -', max_length=60, num_beams=4, no_repeat_ngram_size=3, repetition_penalty=2.0)
31
+ ```
32
 
33
+ *"1 - geel. 2 - groen. 3 - rood. 4 - blauw. 5 - bruin. 6 - zwart. 7 - oranje. 8 - roze. 9 - paars. 10 - wit. 11 - grijs. 12 - magenta. 13 - lila. 14 - lichtgroen. 15"*
34
 
35
+ ## Tokenizer
 
 
 
 
 
36
 
37
+ * BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
38
+ Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
39
 
40
+ ## Dataset
41
+
42
+ This model was trained on the wikipedia and newspapers (3.9B tokens) webpages in
43
+ [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
44
+ which is the original mC4, except
45
+
46
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
47
+ * Sentences with less than 3 words are removed
48
+ * Sentences with a word of more than 1000 characters are removed
49
+ * Documents with less than 5 sentences are removed
50
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
51
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
52
+
53
+ ## Models
54
+
55
+ TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
56
+
57
+ * `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites.
58
+ * The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
59
+
60
+ | | model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config |
61
+ |-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
62
+ | [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M | 512 | 19.9 | 2.99 | 128 | 8 | 558608 | adamw | 2.4e-3 | 1d 12h | news+wiki |
63
+ | [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 4 | 320000/520502 | adafactor | 8e-4 | 7d 2h | full |
64
+ | [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch) | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large |
65
+ | [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full |
66
+
67
+
68
+ ## Acknowledgements
69
+
70
+ This project would not have been possible without compute generously provided by Google through the
71
+ [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
72
+ instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
73
+ and training the models:
74
+
75
+ * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
76
+ * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
77
+ * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
78
+ * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
79
+
80
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)