Update README.md
Browse files
README.md
CHANGED
@@ -15,6 +15,8 @@ tags:
|
|
15 |
- Small
|
16 |
- Huggingface
|
17 |
- Allenai
|
|
|
|
|
18 |
pipeline_tag: text-generation
|
19 |
---
|
20 |
|
@@ -22,17 +24,17 @@ pipeline_tag: text-generation
|
|
22 |
|
23 |
![SmolTulu Banner](smoltulubannerv0.png)
|
24 |
|
25 |
-
SmolTulu-v0 is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://allenai.org/blog/tulu-3-technical) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
26 |
|
27 |
This model scores the highest current score in IFEval while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
28 |
|
29 |
-
## Why v0?
|
30 |
|
31 |
-
There's a few reasons on why I
|
32 |
|
33 |
-
1. The model still lags behind the instruction tuned version of SmolLM2 in
|
34 |
2. This model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run on a model that could be better.
|
35 |
-
3. Initial hyperparameter choice was naive, through some napkin math I've been able to find a much better learning rate that scales the one found in the Tulu 3 paper according to my computational resources better.
|
36 |
|
37 |
# Evaluation
|
38 |
|
|
|
15 |
- Small
|
16 |
- Huggingface
|
17 |
- Allenai
|
18 |
+
- SFT
|
19 |
+
- DPO
|
20 |
pipeline_tag: text-generation
|
21 |
---
|
22 |
|
|
|
24 |
|
25 |
![SmolTulu Banner](smoltulubannerv0.png)
|
26 |
|
27 |
+
SmolTulu-v0.1 is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://allenai.org/blog/tulu-3-technical) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
28 |
|
29 |
This model scores the highest current score in IFEval while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
30 |
|
31 |
+
## Why v0.1?
|
32 |
|
33 |
+
There's a few reasons on why I like calling this model v0.1:
|
34 |
|
35 |
+
1. The model still lags behind the instruction tuned version of SmolLM2 in some other metrics.
|
36 |
2. This model has only undergone SFT and DPO, the RLVR (reinforcement learning with verifiable rewards) stage was too computationally expensive to run on a model that could be better.
|
37 |
+
3. Initial hyperparameter choice during training was naive, through some napkin math I've been able to find a much better learning rate that scales the one found in the Tulu 3 paper according to my computational resources better.
|
38 |
|
39 |
# Evaluation
|
40 |
|