Update README
Browse files- README.md +7 -4
- run_gpt.sh +3 -2
README.md
CHANGED
@@ -14,10 +14,11 @@ datasets:
|
|
14 |
---
|
15 |
# GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
|
16 |
|
17 |
-
|
18 |
|
19 |
-
* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
|
20 |
-
*
|
|
|
21 |
|
22 |
Tokenizer:
|
23 |
|
@@ -26,12 +27,14 @@ Tokenizer:
|
|
26 |
|
27 |
Training details:
|
28 |
|
29 |
-
* Trained for
|
30 |
* Block size: 512
|
31 |
* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
|
32 |
* Warmup steps: 5000
|
33 |
* Weight decay: 0.01
|
34 |
|
|
|
|
|
35 |
Work in progress. Dec 2021-Jan2022
|
36 |
|
37 |
* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
|
|
|
14 |
---
|
15 |
# GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
|
16 |
|
17 |
+
Datasets:
|
18 |
|
19 |
+
* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), dataset config: full (33B tokens)
|
20 |
+
* A recreation of the TBC but for the Dutch language (see e.g.
|
21 |
+
https://github.com/sgraaf/Replicate-Toronto-BookCorpus)
|
22 |
|
23 |
Tokenizer:
|
24 |
|
|
|
27 |
|
28 |
Training details:
|
29 |
|
30 |
+
* Trained for 320k steps (30 dec 2021)
|
31 |
* Block size: 512
|
32 |
* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
|
33 |
* Warmup steps: 5000
|
34 |
* Weight decay: 0.01
|
35 |
|
36 |
+
Further fine-tuned on a Dutch book corpus.
|
37 |
+
|
38 |
Work in progress. Dec 2021-Jan2022
|
39 |
|
40 |
* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
|
run_gpt.sh
CHANGED
@@ -15,6 +15,7 @@ python run_clm_flax.py \
|
|
15 |
--output_dir="${MODEL_PATH}" \
|
16 |
--model_type="gpt2" \
|
17 |
--config_name="${MODEL_PATH}" \
|
|
|
18 |
--tokenizer_name="${MODEL_PATH}" \
|
19 |
--preprocessing_num_workers="96" \
|
20 |
--do_train --do_eval \
|
@@ -26,9 +27,9 @@ python run_clm_flax.py \
|
|
26 |
--learning_rate="0.0024" --warmup_steps="5000" \
|
27 |
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
|
28 |
--overwrite_output_dir \
|
29 |
-
--num_train_epochs="
|
30 |
--logging_steps="500" \
|
31 |
-
--save_steps="
|
32 |
--eval_steps="2500"
|
33 |
|
34 |
# \
|
|
|
15 |
--output_dir="${MODEL_PATH}" \
|
16 |
--model_type="gpt2" \
|
17 |
--config_name="${MODEL_PATH}" \
|
18 |
+
--model_name_or_path="${MODEL_PATH}" \
|
19 |
--tokenizer_name="${MODEL_PATH}" \
|
20 |
--preprocessing_num_workers="96" \
|
21 |
--do_train --do_eval \
|
|
|
27 |
--learning_rate="0.0024" --warmup_steps="5000" \
|
28 |
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
|
29 |
--overwrite_output_dir \
|
30 |
+
--num_train_epochs="4" \
|
31 |
--logging_steps="500" \
|
32 |
+
--save_steps="10001" \
|
33 |
--eval_steps="2500"
|
34 |
|
35 |
# \
|