Update README.md
Browse files
README.md
CHANGED
@@ -118,19 +118,13 @@ model-index:
|
|
118 |
|
119 |
# SmolLM2 1.7b Instruction Tuned & DPO Aligned through Tulu 3!
|
120 |
|
121 |
-
![SmolTulu Banner](
|
122 |
|
123 |
SmolTulu-v0.1 is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://allenai.org/blog/tulu-3-technical) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
124 |
|
125 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
126 |
|
127 |
-
|
128 |
-
|
129 |
-
There's a few reasons on why I like calling this model v0.1:
|
130 |
-
|
131 |
-
1. The model still lags behind the instruction tuned version of SmolLM2 in some other metrics.
|
132 |
-
2. This model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run on a model that could be better.
|
133 |
-
3. Initial hyperparameter choice during training was naive, through some napkin math I've been able to find a much better learning rate that scales the one found in the Tulu 3 paper according to my computational resources better.
|
134 |
|
135 |
# Evaluation
|
136 |
|
|
|
118 |
|
119 |
# SmolLM2 1.7b Instruction Tuned & DPO Aligned through Tulu 3!
|
120 |
|
121 |
+
![SmolTulu Banner](smoltulubanner.png)
|
122 |
|
123 |
SmolTulu-v0.1 is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://allenai.org/blog/tulu-3-technical) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
124 |
|
125 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
126 |
|
127 |
+
Something important to note, this model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run properly.
|
|
|
|
|
|
|
|
|
|
|
|
|
128 |
|
129 |
# Evaluation
|
130 |
|