yhavinga
/

gpt2-medium-dutch-nedd

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

yhavinga commited on Jan 1, 2022

Commit

95795c3

•

1 Parent(s): dd3b617

Update README

Files changed (2) hide show

README.md +7 -4
run_gpt.sh +3 -2

README.md CHANGED Viewed

@@ -14,10 +14,11 @@ datasets:
 ---
 # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
-Dataset:
-* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
-* dataset config: full (33B tokens)
 Tokenizer:
@@ -26,12 +27,14 @@ Tokenizer:
 Training details:
-* Trained for 280k steps (30 dec 2021)
 * Block size: 512
 * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
 * Warmup steps: 5000
 * Weight decay: 0.01
 Work in progress. Dec 2021-Jan2022
 * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!

 ---
 # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
+Datasets:
+* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), dataset config: full (33B tokens)
+* A recreation of the TBC but for the Dutch language (see e.g.
+  https://github.com/sgraaf/Replicate-Toronto-BookCorpus)
 Tokenizer:
 Training details:
+* Trained for 320k steps (30 dec 2021)
 * Block size: 512
 * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
 * Warmup steps: 5000
 * Weight decay: 0.01
+Further fine-tuned on a Dutch book corpus.
 Work in progress. Dec 2021-Jan2022
 * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!

run_gpt.sh CHANGED Viewed

@@ -15,6 +15,7 @@ python run_clm_flax.py \
     --output_dir="${MODEL_PATH}" \
     --model_type="gpt2" \
     --config_name="${MODEL_PATH}" \
     --tokenizer_name="${MODEL_PATH}" \
     --preprocessing_num_workers="96" \
     --do_train --do_eval \
@@ -26,9 +27,9 @@ python run_clm_flax.py \
     --learning_rate="0.0024" --warmup_steps="5000" \
     --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
     --overwrite_output_dir \
-    --num_train_epochs="1" \
     --logging_steps="500" \
-    --save_steps="40000" \
     --eval_steps="2500"
 #     \

     --output_dir="${MODEL_PATH}" \
     --model_type="gpt2" \
     --config_name="${MODEL_PATH}" \
+    --model_name_or_path="${MODEL_PATH}" \
     --tokenizer_name="${MODEL_PATH}" \
     --preprocessing_num_workers="96" \
     --do_train --do_eval \
     --learning_rate="0.0024" --warmup_steps="5000" \
     --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
     --overwrite_output_dir \
+    --num_train_epochs="4" \
     --logging_steps="500" \
+    --save_steps="10001" \
     --eval_steps="2500"
 #     \