gerpt2 / README.md
benjamin's picture
add README.md
a7cf889
|
raw
history blame
4.27 kB
metadata
language: de
widget:
  - text: >-
      In einer schockierenden Entdeckung fanden Wissenschaftler eine Herde
      Einhörner, die in einem abgelegenen, zuvor unerforschten Tal in den Anden
      lebten.
license: mit

GerPT2

A small German GPT2.

See the GPT2 model card for considerations on limitations and bias. See the GPT2 documentation for details on GPT2.

Comparison to dbmdz/german-gpt2

I evaluated both GerPT2 and the other German GPT2, dbmdz/german-gpt2 on the CC-100 dataset and on the German Wikipedia:

CC-100 (PPL) Wikipedia (PPL)
dbmdz/german-gpt2 49.47 62.92
GerPT2 24.78 35.33

See the script evaluate.py in the GerPT2 Github repository for the code.

Usage

GerPT2 usage

Also, two tricks might improve the generated text:

output = model.generate(
    # during training an EOS token was used to mark the beginning of each text
    # so it can help to insert it at the start
    torch.tensor(
        [tokenizer.eos_token_id] + tokenizer.encode(prompt)
    ).unsqueeze(0),
    do_sample=True,
    # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
    # prone to ending generation early because a significant number of texts from the training corpus
    # is quite short
    bad_words_ids=[[0]],
    max_length=max_length,
)[0]
print(tokenizer.decode(output))

Training details

GerPT2 is trained on the entire German data (67GB) from the CC-100 Corpus and weights were initialized from the English GPT2 model. GerPT2 was trained with:

  • a batch size of 256
  • using OneCycle learning rate with a maximum of 5e-3
  • with AdamW with a weight decay of 0.01
  • for 7 epochs

Training took roughly 6 days on 8 TPUv3 cores.

To train GerPT2, follow these steps. Scripts are located in the Github repository:

  1. Download and unzip training data from http://data.statmt.org/cc-100/.
  2. Train a tokenizer using prepare/train_tokenizer.py. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
  3. (optionally) generate a German input embedding matrix with prepare/generate_aligned_wte.py. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:
ĠMinde -> Ġleast
Ġjed -> Ġwhatsoever
flughafen -> Air
vermittlung -> employment
teilung -> ignment
ĠInterpretation -> Ġinterpretation
Ġimport -> Ġimported
hansa -> irl
genehmigungen -> exempt
ĠAuflist -> Ġlists
Ġverschwunden -> Ġdisappeared
ĠFlyers -> ĠFlyers
Kanal -> Channel
Ġlehr -> Ġteachers
Ġnahelie -> Ġconvenient
gener -> Generally
mitarbeiter -> staff

This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the wte_path to the training script. Credit to this blogpost for the idea of initializing GPT2 from English weights.

  1. Tokenize the corpus using prepare/tokenize_text.py. This generates files for train and validation tokens in JSON Lines format.
  2. Run the training script train.py! run.sh shows how this was executed for the full run with config configs/tpu.json.

License

GerPT2 is licensed under the MIT License.

Acknowledgements

Thanks to Hugging Face for awesome tools and infrastructure. Special thanks to PetFinder.my for generously sponsoring the resources used for training.