baby-llama-58m / README.md

Add a short description and the fine-tuning hyperparameters

8b2dfe9 over 1 year ago

1.91 kB

	---
	license: unknown
	language:
	- en
	---

	# Baby Llama

	Our submission to the `strict-small` track of the [BabyLM challenge](https://babylm.github.io/index.html).

	Baby Llama is a 58-million-parameter model, distilled from an ensemble consisting of LLaMA-360M and GPT2-705M, both trained on the `babylm_10M` dataset.

	See the associated paper (arXiv number TBA) for a detailed discussion of the training procedure and of the model performance.

	### Hyperparameters for the tasks requiring fine-tuning

	When evaluating the model on the [tasks that require fine-tuning](https://github.com/babylm/evaluation-pipeline/tree/main#fine-tuning),
	we noticed that the [default hyperparameters](https://github.com/babylm/evaluation-pipeline/tree/main#hyperparameters)
	suggested by the BabyLM organizers lead to severe overfitting in a number of tasks.
	To avoid this issue, we have re-tuned those hyperparameters.

	The sets of hyperparameters selected for each task are listed in the table below.
	A star (*) indicates that the early-stopping criterion was triggered before the specified number of epochs was reached.

	\| Task \| Initial learning rate \| Batch size \| Maximum epochs \| Patience \| Evaluate every (steps) \| Random seed \|
	\| ---- \| ------------- \| ---------- \| -------- \| -------- \| ---------- \| ---- \|
	\| CoLA \| \| \| \| \| \| \|
	\| SST-2 \| \| \| \| \| \| \|
	\| MRPC \| \| \| \| \| \| \|
	\| QQP \| \| \| \| \| \| \|
	\| MNLI \| \| \| \| \| \| \|
	\| MNLI-mm \| \| \| \| \| \| \|
	\| QNLI \| \| \| \| \| \| \|
	\| RTE \| 5e-5 \| 64 \| 6 \| 10 \| 200 \| 12 \|
	\| BoolQ \| 3e-4 \| 16 \| 10* \| 10 \| 10 \| 12 \|
	\| MultiRC \| 1e-4 \| 64 \| 7 \| 10 \| 1000 \| 42 \|
	\| WSC \| 5e-7 \| 1 \| 10 \| 1000 \| 2000 \| 12 \|
	\| CR (Control) \| \| \| \| \| \| \|
	\| LC (Control) \| \| \| \| \| \| \|
	\| MV (Control) \| \| \| \| \| \| \|
	\| RP (Control) \| \| \| \| \| \| \|
	\| SC (Control) \| \| \| \| \| \| \|
	\| CR\_LC \| \| \| \| \| \| \|
	\| CR\_RTP \| \| \| \| \| \| \|
	\| MV\_LC \| \| \| \| \| \| \|
	\| MV\_RTP \| \| \| \| \| \| \|
	\| SC\_LC \| \| \| \| \| \| \|
	\| SC\_RP \| \| \| \| \| \| \|

	---
	license: unknown
	language:
	- en
	---

	# Baby Llama

	Our submission to the `strict-small` track of the [BabyLM challenge](https://babylm.github.io/index.html).

	Baby Llama is a 58-million-parameter model, distilled from an ensemble consisting of LLaMA-360M and GPT2-705M, both trained on the `babylm_10M` dataset.

	See the associated paper (arXiv number TBA) for a detailed discussion of the training procedure and of the model performance.

	### Hyperparameters for the tasks requiring fine-tuning

	When evaluating the model on the [tasks that require fine-tuning](https://github.com/babylm/evaluation-pipeline/tree/main#fine-tuning),
	we noticed that the [default hyperparameters](https://github.com/babylm/evaluation-pipeline/tree/main#hyperparameters)
	suggested by the BabyLM organizers lead to severe overfitting in a number of tasks.
	To avoid this issue, we have re-tuned those hyperparameters.

	The sets of hyperparameters selected for each task are listed in the table below.
	A star (*) indicates that the early-stopping criterion was triggered before the specified number of epochs was reached.

	\| Task \| Initial learning rate \| Batch size \| Maximum epochs \| Patience \| Evaluate every (steps) \| Random seed \|
	\| ---- \| ------------- \| ---------- \| -------- \| -------- \| ---------- \| ---- \|
	\| CoLA \| \| \| \| \| \| \|
	\| SST-2 \| \| \| \| \| \| \|
	\| MRPC \| \| \| \| \| \| \|
	\| QQP \| \| \| \| \| \| \|
	\| MNLI \| \| \| \| \| \| \|
	\| MNLI-mm \| \| \| \| \| \| \|
	\| QNLI \| \| \| \| \| \| \|
	\| RTE \| 5e-5 \| 64 \| 6 \| 10 \| 200 \| 12 \|
	\| BoolQ \| 3e-4 \| 16 \| 10* \| 10 \| 10 \| 12 \|
	\| MultiRC \| 1e-4 \| 64 \| 7 \| 10 \| 1000 \| 42 \|
	\| WSC \| 5e-7 \| 1 \| 10 \| 1000 \| 2000 \| 12 \|
	\| CR (Control) \| \| \| \| \| \| \|
	\| LC (Control) \| \| \| \| \| \| \|
	\| MV (Control) \| \| \| \| \| \| \|
	\| RP (Control) \| \| \| \| \| \| \|
	\| SC (Control) \| \| \| \| \| \| \|
	\| CR\_LC \| \| \| \| \| \| \|
	\| CR\_RTP \| \| \| \| \| \| \|
	\| MV\_LC \| \| \| \| \| \| \|
	\| MV\_RTP \| \| \| \| \| \| \|
	\| SC\_LC \| \| \| \| \| \| \|
	\| SC\_RP \| \| \| \| \| \| \|