SultanR commited on
Commit
2116ae7
1 Parent(s): f999641

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -29,6 +29,7 @@ pipeline_tag: text-generation
29
  SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124)
30
 
31
  This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
 
32
  ## Evaluation
33
 
34
  I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
@@ -44,6 +45,24 @@ I ran these evaluations using [SmolLM2's evaluation code](https://github.com/hug
44
  | MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
45
  | PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ## Usage
48
 
49
  Just like any Huggingface model, just run it using the transformers library:
 
29
  SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124)
30
 
31
  This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
32
+
33
  ## Evaluation
34
 
35
  I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
 
45
  | MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
46
  | PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |
47
 
48
+ ## Training Details
49
+
50
+ The reinforced model used PPO with verifiable rewards:
51
+ - Base model: SmolTulu-1.7b-Instruct
52
+ - Learning rate: 3e-6
53
+ - Total training episodes: 10M
54
+ - PPO KL penalty coefficient (beta): 0.05
55
+ - Maximum sequence/prompt length: 2048 tokens
56
+ - Response length: 2048 tokens
57
+ - Rollout batch size: 32
58
+ - Minibatch size: 32
59
+ - Temperature: 1.0
60
+ - Penalty reward: -10.0 for incomplete generations
61
+ - DeepSpeed Stage 3 optimization
62
+ - Gradient checkpointing enabled
63
+ - Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
64
+ - Reward model multiplier: 0.0 (pure verifiable rewards)
65
+
66
  ## Usage
67
 
68
  Just like any Huggingface model, just run it using the transformers library: