jarodrigues
commited on
Commit
•
190623d
1
Parent(s):
d2206f3
Update README.md
Browse files
README.md
CHANGED
@@ -113,7 +113,7 @@ Specifically, while the entire prompt received attention during fine-tuning, onl
|
|
113 |
|
114 |
In terms of hyper-parameters, the model was trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
|
115 |
|
116 |
-
Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we
|
117 |
In other words, each example occupies the full input sequence length.
|
118 |
|
119 |
To achieve this, we adapted the tokenizer of the base model to accept padding to allow grouping examples with different size into batches while preserving the original input sequence length.
|
|
|
113 |
|
114 |
In terms of hyper-parameters, the model was trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
|
115 |
|
116 |
+
Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separated each example individually.
|
117 |
In other words, each example occupies the full input sequence length.
|
118 |
|
119 |
To achieve this, we adapted the tokenizer of the base model to accept padding to allow grouping examples with different size into batches while preserving the original input sequence length.
|