baby-llama-58m / README.md
JLTastet's picture
Add a short description and the fine-tuning hyperparameters
8b2dfe9
|
raw
history blame
1.91 kB
metadata
license: unknown
language:
  - en

Baby Llama

Our submission to the strict-small track of the BabyLM challenge.

Baby Llama is a 58-million-parameter model, distilled from an ensemble consisting of LLaMA-360M and GPT2-705M, both trained on the babylm_10M dataset.

See the associated paper (arXiv number TBA) for a detailed discussion of the training procedure and of the model performance.

Hyperparameters for the tasks requiring fine-tuning

When evaluating the model on the tasks that require fine-tuning, we noticed that the default hyperparameters suggested by the BabyLM organizers lead to severe overfitting in a number of tasks. To avoid this issue, we have re-tuned those hyperparameters.

The sets of hyperparameters selected for each task are listed in the table below. A star (*) indicates that the early-stopping criterion was triggered before the specified number of epochs was reached.

Task Initial learning rate Batch size Maximum epochs Patience Evaluate every (steps) Random seed
CoLA
SST-2
MRPC
QQP
MNLI
MNLI-mm
QNLI
RTE 5e-5 64 6 10 200 12
BoolQ 3e-4 16 10* 10 10 12
MultiRC 1e-4 64 7 10 1000 42
WSC 5e-7 1 10 1000 2000 12
CR (Control)
LC (Control)
MV (Control)
RP (Control)
SC (Control)
CR_LC
CR_RTP
MV_LC
MV_RTP
SC_LC
SC_RP