SultanR
/

SmolTulu-1.7b-Reinforced

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

SultanR commited on 1 day ago

Commit

2116ae7

•

1 Parent(s): f999641

Update README.md

Files changed (1) hide show

README.md +19 -0

README.md CHANGED Viewed

@@ -29,6 +29,7 @@ pipeline_tag: text-generation
 SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124)
 This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
 ## Evaluation
 I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
@@ -44,6 +45,24 @@ I ran these evaluations using [SmolLM2's evaluation code](https://github.com/hug
 | MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
 | PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |
 ## Usage
 Just like any Huggingface model, just run it using the transformers library:

 SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124)
 This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
 ## Evaluation
 I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
 | MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
 | PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |
+## Training Details
+The reinforced model used PPO with verifiable rewards:
+- Base model: SmolTulu-1.7b-Instruct
+- Learning rate: 3e-6
+- Total training episodes: 10M
+- PPO KL penalty coefficient (beta): 0.05
+- Maximum sequence/prompt length: 2048 tokens
+- Response length: 2048 tokens
+- Rollout batch size: 32
+- Minibatch size: 32
+- Temperature: 1.0
+- Penalty reward: -10.0 for incomplete generations
+- DeepSpeed Stage 3 optimization
+- Gradient checkpointing enabled
+- Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
+- Reward model multiplier: 0.0 (pure verifiable rewards)
 ## Usage
 Just like any Huggingface model, just run it using the transformers library: