add training details
Browse files
README.md
CHANGED
@@ -128,15 +128,23 @@ result = summarizer(
|
|
128 |
|
129 |
```
|
130 |
|
|
|
|
|
|
|
131 |
## Training and evaluation data
|
132 |
|
133 |
- the [booksum](https://arxiv.org/abs/2105.08209) dataset
|
134 |
-
- During training, the input text was the text of the chapter
|
135 |
|
136 |
## Training procedure
|
137 |
|
|
|
|
|
|
|
138 |
### Training hyperparameters
|
139 |
|
|
|
|
|
140 |
The following hyperparameters were used during training:
|
141 |
- learning_rate: 5e-05
|
142 |
- train_batch_size: 1
|
@@ -149,13 +157,41 @@ The following hyperparameters were used during training:
|
|
149 |
- lr_scheduler_type: linear
|
150 |
- num_epochs: 3
|
151 |
|
152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
154 |
|
155 |
|
156 |
### Framework versions
|
157 |
|
158 |
-
- Transformers 4.
|
159 |
-
- Pytorch 1.
|
160 |
-
- Datasets
|
161 |
-
- Tokenizers 0.
|
|
|
128 |
|
129 |
```
|
130 |
|
131 |
+
|
132 |
+
**Important:** To generate the best quality summaries, you should use the global attention mask when decoding, as demonstrated in [this community notebook here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing), see the definition of `generate_answer(batch)`.
|
133 |
+
|
134 |
## Training and evaluation data
|
135 |
|
136 |
- the [booksum](https://arxiv.org/abs/2105.08209) dataset
|
137 |
+
- During training, the input text was the text of the `chapter`, and the output was `summary_text`
|
138 |
|
139 |
## Training procedure
|
140 |
|
141 |
+
- Training completed on the BookSum dataset for 13 total epochs
|
142 |
+
- **The final four epochs combined the training and validation sets as 'train' in an effort to increase generalization.**
|
143 |
+
|
144 |
### Training hyperparameters
|
145 |
|
146 |
+
#### Initial Three Epochs
|
147 |
+
|
148 |
The following hyperparameters were used during training:
|
149 |
- learning_rate: 5e-05
|
150 |
- train_batch_size: 1
|
|
|
157 |
- lr_scheduler_type: linear
|
158 |
- num_epochs: 3
|
159 |
|
160 |
+
#### In-between Epochs
|
161 |
+
|
162 |
+
Unfortunately, don't have all records on-hand for middle epochs, the following should be representative:
|
163 |
+
|
164 |
+
- learning_rate: 4e-05
|
165 |
+
- train_batch_size: 2
|
166 |
+
- eval_batch_size: 2
|
167 |
+
- seed: 42
|
168 |
+
- distributed_type: multi-GPU
|
169 |
+
- gradient_accumulation_steps: 16
|
170 |
+
- total_train_batch_size: 32
|
171 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
172 |
+
- lr_scheduler_type: cosine
|
173 |
+
- lr_scheduler_warmup_ratio: 0.05
|
174 |
+
- num_epochs: 6 (in addition to prior model)
|
175 |
+
|
176 |
+
#### Final Two Epochs
|
177 |
|
178 |
+
The following hyperparameters were used during training:
|
179 |
+
- learning_rate: 2e-05
|
180 |
+
- train_batch_size: 1
|
181 |
+
- eval_batch_size: 1
|
182 |
+
- seed: 42
|
183 |
+
- distributed_type: multi-GPU
|
184 |
+
- gradient_accumulation_steps: 16
|
185 |
+
- total_train_batch_size: 16
|
186 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
187 |
+
- lr_scheduler_type: cosine
|
188 |
+
- lr_scheduler_warmup_ratio: 0.03
|
189 |
+
- num_epochs: 2 (in addition to prior model)
|
190 |
|
191 |
|
192 |
### Framework versions
|
193 |
|
194 |
+
- Transformers 4.19.2
|
195 |
+
- Pytorch 1.11.0+cu113
|
196 |
+
- Datasets 2.2.2
|
197 |
+
- Tokenizers 0.12.1
|