SultanR's picture
Update README.md
2116ae7 verified
|
raw
history blame
3.79 kB
metadata
license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - Tulu3
  - Smollm
  - SLMs
  - Small
  - Huggingface
  - Allenai
  - SFT
  - DPO
  - GGUF
  - RLVR
  - RL
base_model:
  - SultanR/SmolTulu-1.7b-Instruct
datasets:
  - allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
pipeline_tag: text-generation

SmolLM2 1.7b Aligned and Reinforced Through Tulu 3!

SmolTulu Banner

SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of SmolTulu-1.7b-Instruct, which leverages AllenAI's Tulu 3 post-training pipeline

This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.

Evaluation

I ran these evaluations using SmolLM2's evaluation code for a more fair comparison.

Metric SmolTulu-1.7b-Instruct SmolTulu-1.7b-Reinforced SmolLM2-1.7B-Instruct Llama-1B-Instruct Qwen2.5-1.5B-Instruct SmolLM1-1.7B-Instruct
ARC (Average) 51.5 51.1 51.7 41.6 46.2 43.7
BBH (3-shot) 33.8 33.4 32.2 27.6 35.3 25.7
GSM8K (5-shot) 51.6 61.0 48.2 26.8 42.8 4.6
HellaSwag 61.1 60.4 66.1 56.1 60.9 55.5
IFEval (Average prompt/inst) 67.7 69.3 56.7 53.5 47.4 23.1
MMLU-Pro (MCF) 17.4 17.3 19.3 12.7 24.2 11.7
PIQA 72.2 72.1 74.4 72.3 73.2 71.6

Training Details

The reinforced model used PPO with verifiable rewards:

  • Base model: SmolTulu-1.7b-Instruct
  • Learning rate: 3e-6
  • Total training episodes: 10M
  • PPO KL penalty coefficient (beta): 0.05
  • Maximum sequence/prompt length: 2048 tokens
  • Response length: 2048 tokens
  • Rollout batch size: 32
  • Minibatch size: 32
  • Temperature: 1.0
  • Penalty reward: -10.0 for incomplete generations
  • DeepSpeed Stage 3 optimization
  • Gradient checkpointing enabled
  • Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
  • Reward model multiplier: 0.0 (pure verifiable rewards)

Usage

Just like any Huggingface model, just run it using the transformers library:

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "SultanR/SmolTulu-1.7b-Reinforced"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Citation

@misc{alrashed2024smoltuluhigherlearningrate,
      title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs}, 
      author={Sultan Alrashed},
      year={2024},
      eprint={2412.08347},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.08347}, 
}

The training methodology follows the Tulu 3 paper:

@article{lambert2024tulu3,
  title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training},
  author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others},
  year={2024},
  journal={arXiv preprint arXiv:2411.15124}
}