benjamin commited on
Commit
f82bf1a
·
1 Parent(s): 6d40811

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -50
README.md CHANGED
@@ -12,7 +12,6 @@ license: mit
12
  German large and small versions of GPT2:
13
 
14
  - https://huggingface.co/benjamin/gerpt2
15
-
16
  - https://huggingface.co/benjamin/gerpt2-large
17
 
18
  See the [GPT2 model card](https://huggingface.co/gpt2) for considerations on limitations and bias. See the [GPT2 documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for details on GPT2.
@@ -22,81 +21,55 @@ See the [GPT2 model card](https://huggingface.co/gpt2) for considerations on lim
22
  I evaluated both GerPT2-large and the other German GPT2, [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2) on the [CC-100](http://data.statmt.org/cc-100/) dataset and on the German Wikipedia:
23
 
24
  | | CC-100 (PPL) | Wikipedia (PPL) |
25
-
26
  |-------------------|--------------|-----------------|
27
-
28
  | dbmdz/german-gpt2 | 49.47 | 62.92 |
29
-
30
  | GerPT2 | 24.78 | 35.33 |
31
-
32
  | GerPT2-large | __16.08__ | __23.26__ |
 
33
 
34
  See the script `evaluate.py` in the [GerPT2 Github repository](https://github.com/bminixhofer/gerpt2) for the code.
35
 
36
  ## Usage
37
 
38
  ```python
39
-
40
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
41
 
42
  tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2-large")
43
-
44
  model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2-large")
45
 
46
  prompt = "<your prompt>"
47
 
48
  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
49
-
50
  print(pipe(prompt)[0]["generated_text"])
51
-
52
  ```
53
 
54
  Also, two tricks might improve the generated text:
55
 
56
  ```python
57
-
58
  output = model.generate(
59
-
60
  # during training an EOS token was used to mark the beginning of each text
61
-
62
  # so it can help to insert it at the start
63
-
64
  torch.tensor(
65
-
66
  [tokenizer.eos_token_id] + tokenizer.encode(prompt)
67
-
68
  ).unsqueeze(0),
69
-
70
  do_sample=True,
71
-
72
  # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
73
-
74
  # prone to ending generation early because a significant number of texts from the training corpus
75
-
76
  # is quite short
77
-
78
  bad_words_ids=[[0]],
79
-
80
  max_length=max_length,
81
-
82
  )[0]
83
-
84
  print(tokenizer.decode(output))
85
-
86
  ```
87
 
88
  ## Training details
89
 
90
  GerPT2-large is trained on the entire German data (67GB) from the [CC-100 Corpus](http://data.statmt.org/cc-100/) and weights were initialized from the [English GPT2 model](https://huggingface.co/gpt2-large).
91
-
92
  GerPT2-large was trained with:
93
 
94
  - a batch size of 256
95
-
96
  - using OneCycle learning rate with a maximum of 5e-3
97
-
98
  - with AdamW with a weight decay of 0.01
99
-
100
  - for 2 epochs
101
 
102
  Training took roughly 12 days on 8 TPUv3 cores.
@@ -104,61 +77,54 @@ Training took roughly 12 days on 8 TPUv3 cores.
104
  To train GerPT2-large, follow these steps. Scripts are located in the [Github repository](https://github.com/bminixhofer/gerpt2):
105
 
106
  0. Download and unzip training data from http://data.statmt.org/cc-100/.
107
-
108
  1. Train a tokenizer using `prepare/train_tokenizer.py`. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
109
-
110
  2. (optionally) generate a German input embedding matrix with `prepare/generate_aligned_wte.py`. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:
111
 
112
  ```
113
-
114
  ĠMinde -> Ġleast
115
-
116
  Ġjed -> Ġwhatsoever
117
-
118
  flughafen -> Air
119
-
120
  vermittlung -> employment
121
-
122
  teilung -> ignment
123
-
124
  ĠInterpretation -> Ġinterpretation
125
-
126
  Ġimport -> Ġimported
127
-
128
  hansa -> irl
129
-
130
  genehmigungen -> exempt
131
-
132
  ĠAuflist -> Ġlists
133
-
134
  Ġverschwunden -> Ġdisappeared
135
-
136
  ĠFlyers -> ĠFlyers
137
-
138
  Kanal -> Channel
139
-
140
  Ġlehr -> Ġteachers
141
-
142
  Ġnahelie -> Ġconvenient
143
-
144
  gener -> Generally
145
-
146
  mitarbeiter -> staff
147
-
148
  ```
149
 
150
  This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the `wte_path` to the training script. Credit to [this blogpost](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787) for the idea of initializing GPT2 from English weights.
151
 
152
  3. Tokenize the corpus using `prepare/tokenize_text.py`. This generates files for train and validation tokens in JSON Lines format.
153
-
154
  4. Run the training script `train.py`! `run.sh` shows how this was executed for the full run with config `configs/tpu_large.json`.
155
 
156
  ## License
157
 
158
  GerPT2-large is licensed under the MIT License.
159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
  ## Acknowledgements
161
 
162
  Thanks to [Hugging Face](https://huggingface.co) for awesome tools and infrastructure.
163
-
164
  Huge thanks to [Artus Krohn-Grimberghe](https://twitter.com/artuskg) at [LYTiQ](https://www.lytiq.de/) for making this possible by sponsoring the resources used for training.
 
12
  German large and small versions of GPT2:
13
 
14
  - https://huggingface.co/benjamin/gerpt2
 
15
  - https://huggingface.co/benjamin/gerpt2-large
16
 
17
  See the [GPT2 model card](https://huggingface.co/gpt2) for considerations on limitations and bias. See the [GPT2 documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for details on GPT2.
 
21
  I evaluated both GerPT2-large and the other German GPT2, [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2) on the [CC-100](http://data.statmt.org/cc-100/) dataset and on the German Wikipedia:
22
 
23
  | | CC-100 (PPL) | Wikipedia (PPL) |
 
24
  |-------------------|--------------|-----------------|
 
25
  | dbmdz/german-gpt2 | 49.47 | 62.92 |
 
26
  | GerPT2 | 24.78 | 35.33 |
 
27
  | GerPT2-large | __16.08__ | __23.26__ |
28
+ | | | |
29
 
30
  See the script `evaluate.py` in the [GerPT2 Github repository](https://github.com/bminixhofer/gerpt2) for the code.
31
 
32
  ## Usage
33
 
34
  ```python
 
35
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
36
 
37
  tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2-large")
 
38
  model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2-large")
39
 
40
  prompt = "<your prompt>"
41
 
42
  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
 
43
  print(pipe(prompt)[0]["generated_text"])
 
44
  ```
45
 
46
  Also, two tricks might improve the generated text:
47
 
48
  ```python
 
49
  output = model.generate(
 
50
  # during training an EOS token was used to mark the beginning of each text
 
51
  # so it can help to insert it at the start
 
52
  torch.tensor(
 
53
  [tokenizer.eos_token_id] + tokenizer.encode(prompt)
 
54
  ).unsqueeze(0),
 
55
  do_sample=True,
 
56
  # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
 
57
  # prone to ending generation early because a significant number of texts from the training corpus
 
58
  # is quite short
 
59
  bad_words_ids=[[0]],
 
60
  max_length=max_length,
 
61
  )[0]
 
62
  print(tokenizer.decode(output))
 
63
  ```
64
 
65
  ## Training details
66
 
67
  GerPT2-large is trained on the entire German data (67GB) from the [CC-100 Corpus](http://data.statmt.org/cc-100/) and weights were initialized from the [English GPT2 model](https://huggingface.co/gpt2-large).
 
68
  GerPT2-large was trained with:
69
 
70
  - a batch size of 256
 
71
  - using OneCycle learning rate with a maximum of 5e-3
 
72
  - with AdamW with a weight decay of 0.01
 
73
  - for 2 epochs
74
 
75
  Training took roughly 12 days on 8 TPUv3 cores.
 
77
  To train GerPT2-large, follow these steps. Scripts are located in the [Github repository](https://github.com/bminixhofer/gerpt2):
78
 
79
  0. Download and unzip training data from http://data.statmt.org/cc-100/.
 
80
  1. Train a tokenizer using `prepare/train_tokenizer.py`. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
 
81
  2. (optionally) generate a German input embedding matrix with `prepare/generate_aligned_wte.py`. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:
82
 
83
  ```
 
84
  ĠMinde -> Ġleast
 
85
  Ġjed -> Ġwhatsoever
 
86
  flughafen -> Air
 
87
  vermittlung -> employment
 
88
  teilung -> ignment
 
89
  ĠInterpretation -> Ġinterpretation
 
90
  Ġimport -> Ġimported
 
91
  hansa -> irl
 
92
  genehmigungen -> exempt
 
93
  ĠAuflist -> Ġlists
 
94
  Ġverschwunden -> Ġdisappeared
 
95
  ĠFlyers -> ĠFlyers
 
96
  Kanal -> Channel
 
97
  Ġlehr -> Ġteachers
 
98
  Ġnahelie -> Ġconvenient
 
99
  gener -> Generally
 
100
  mitarbeiter -> staff
 
101
  ```
102
 
103
  This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the `wte_path` to the training script. Credit to [this blogpost](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787) for the idea of initializing GPT2 from English weights.
104
 
105
  3. Tokenize the corpus using `prepare/tokenize_text.py`. This generates files for train and validation tokens in JSON Lines format.
 
106
  4. Run the training script `train.py`! `run.sh` shows how this was executed for the full run with config `configs/tpu_large.json`.
107
 
108
  ## License
109
 
110
  GerPT2-large is licensed under the MIT License.
111
 
112
+ ## Citing
113
+
114
+ Please cite GerPT2 as follows:
115
+
116
+ ```
117
+ @misc{Minixhofer_GerPT2_German_large_2020,
118
+ author = {Minixhofer, Benjamin},
119
+ doi = {10.5281/zenodo.5509984},
120
+ month = {12},
121
+ title = {{GerPT2: German large and small versions of GPT2}},
122
+ url = {https://github.com/bminixhofer/gerpt2},
123
+ year = {2020}
124
+ }
125
+ ```
126
+
127
  ## Acknowledgements
128
 
129
  Thanks to [Hugging Face](https://huggingface.co) for awesome tools and infrastructure.
 
130
  Huge thanks to [Artus Krohn-Grimberghe](https://twitter.com/artuskg) at [LYTiQ](https://www.lytiq.de/) for making this possible by sponsoring the resources used for training.