Update README.md
Browse files
README.md
CHANGED
@@ -29,6 +29,7 @@ pipeline_tag: text-generation
|
|
29 |
SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124)
|
30 |
|
31 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
|
|
|
32 |
## Evaluation
|
33 |
|
34 |
I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
|
@@ -44,6 +45,24 @@ I ran these evaluations using [SmolLM2's evaluation code](https://github.com/hug
|
|
44 |
| MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
|
45 |
| PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
## Usage
|
48 |
|
49 |
Just like any Huggingface model, just run it using the transformers library:
|
|
|
29 |
SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124)
|
30 |
|
31 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
|
32 |
+
|
33 |
## Evaluation
|
34 |
|
35 |
I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
|
|
|
45 |
| MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
|
46 |
| PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |
|
47 |
|
48 |
+
## Training Details
|
49 |
+
|
50 |
+
The reinforced model used PPO with verifiable rewards:
|
51 |
+
- Base model: SmolTulu-1.7b-Instruct
|
52 |
+
- Learning rate: 3e-6
|
53 |
+
- Total training episodes: 10M
|
54 |
+
- PPO KL penalty coefficient (beta): 0.05
|
55 |
+
- Maximum sequence/prompt length: 2048 tokens
|
56 |
+
- Response length: 2048 tokens
|
57 |
+
- Rollout batch size: 32
|
58 |
+
- Minibatch size: 32
|
59 |
+
- Temperature: 1.0
|
60 |
+
- Penalty reward: -10.0 for incomplete generations
|
61 |
+
- DeepSpeed Stage 3 optimization
|
62 |
+
- Gradient checkpointing enabled
|
63 |
+
- Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
|
64 |
+
- Reward model multiplier: 0.0 (pure verifiable rewards)
|
65 |
+
|
66 |
## Usage
|
67 |
|
68 |
Just like any Huggingface model, just run it using the transformers library:
|