metadata

language: de
widget:
  - text: >-
      In einer schockierenden Entdeckung fanden Wissenschaftler eine Herde
      Einhörner, die in einem abgelegenen, zuvor unerforschten Tal in den Anden
      lebten.
license: mit

GerPT2

A small German GPT2.

See the GPT2 model card for considerations on limitations and bias. See the GPT2 documentation for details on GPT2.

Comparison to dbmdz/german-gpt2

I evaluated both GerPT2 and the other German GPT2, dbmdz/german-gpt2 on the CC-100 dataset and on the German Wikipedia:

	CC-100 (PPL)	Wikipedia (PPL)
dbmdz/german-gpt2	49.47	62.92
GerPT2	24.78	35.33

See the script evaluate.py in the GerPT2 Github repository for the code.

Usage

Also, two tricks might improve the generated text:

output = model.generate(
    # during training an EOS token was used to mark the beginning of each text
    # so it can help to insert it at the start
    torch.tensor(
        [tokenizer.eos_token_id] + tokenizer.encode(prompt)
    ).unsqueeze(0),
    do_sample=True,
    # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
    # prone to ending generation early because a significant number of texts from the training corpus
    # is quite short
    bad_words_ids=[[0]],
    max_length=max_length,
)[0]
print(tokenizer.decode(output))

Training details

GerPT2 is trained on the entire German data (67GB) from the CC-100 Corpus and weights were initialized from the English GPT2 model. GerPT2 was trained with:

a batch size of 256
using OneCycle learning rate with a maximum of 5e-3
with AdamW with a weight decay of 0.01
for 7 epochs

Training took roughly 6 days on 8 TPUv3 cores.

To train GerPT2, follow these steps. Scripts are located in the Github repository:

Download and unzip training data from http://data.statmt.org/cc-100/.
Train a tokenizer using prepare/train_tokenizer.py. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
(optionally) generate a German input embedding matrix with prepare/generate_aligned_wte.py. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:

ĠMinde -> Ġleast
Ġjed -> Ġwhatsoever
flughafen -> Air
vermittlung -> employment
teilung -> ignment
ĠInterpretation -> Ġinterpretation
Ġimport -> Ġimported
hansa -> irl
genehmigungen -> exempt
ĠAuflist -> Ġlists
Ġverschwunden -> Ġdisappeared
ĠFlyers -> ĠFlyers
Kanal -> Channel
Ġlehr -> Ġteachers
Ġnahelie -> Ġconvenient
gener -> Generally
mitarbeiter -> staff

This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the wte_path to the training script. Credit to this blogpost for the idea of initializing GPT2 from English weights.

Tokenize the corpus using prepare/tokenize_text.py. This generates files for train and validation tokens in JSON Lines format.
Run the training script train.py! run.sh shows how this was executed for the full run with config configs/tpu.json.

License

GerPT2 is licensed under the MIT License.

Acknowledgements

Thanks to Hugging Face for awesome tools and infrastructure. Special thanks to PetFinder.my for generously sponsoring the resources used for training.