metadata

language: nl
widget:
  - text: In het jaar 2030 zullen we
  - text: Toen ik gisteren volledig in de ban was van
  - text: >-
      Studenten en leraren van de Bogazici Universiteit in de Turkse stad
      Istanbul
  - text: In Israël was een strenge lockdown
tags:
  - gpt2-medium
  - gpt2
pipeline_tag: text-generation
datasets:
  - yhavinga/mc4_nl_cleaned

GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱

Datasets:

mC4 NL Cleaned, dataset config: full (33B tokens)
A recreation of the TBC but for the Dutch language (see e.g. https://github.com/sgraaf/Replicate-Toronto-BookCorpus)

Tokenizer:

Tokenizer trained on mC4 with scripts from the Huggingface Transformers Flax examples

Training details:

Trained for 320k steps (30 dec 2021)
Block size: 512
Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
Warmup steps: 5000
Weight decay: 0.01

Further fine-tuned on a Dutch book corpus.

Work in progress. Dec 2021-Jan2022

Many thanks to the Google TPU Research Cloud for providing access to a TPU cluster!
Thanks to @gsarti for creating the t5-flax-gcp repository.
Also thanks to the creators of gpt2-medium-persian and gpt2-medium-indonesian for sharing their training scripts!