|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- Tulu3 |
|
- Smollm |
|
- SLMs |
|
- Small |
|
- Huggingface |
|
- Allenai |
|
- Reward Model |
|
- RLVR |
|
- RM |
|
- Reward |
|
base_model: |
|
- SultanR/SmolTulu-1.7b-Instruct |
|
datasets: |
|
- allenai/llama-3.1-tulu-3-8b-preference-mixture |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# SmolLM2 1.7b Reward Model for RLVR Through Tulu 3! |
|
|
|
![SmolTulu Banner](smoltulubanner.png) |
|
|
|
SmolTulu-1.7b-RM is the reward model used to initialize the value function for [SmolTulu-1.7b-Reinforced](https://huggingface.co/SultanR/SmolTulu-1.7b-Reinforced), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124) for reinforcement learning with verifiable rewards (RLVR). This model was trained using the same preference datasets and methodology as outlined in the Tulu 3 paper, adapted for the smaller model size. |
|
|
|
## Evaluation |
|
|
|
Evaluation results comparing SmolTulu-1.7b-RM against the Tulu 3 8b reward model on standard reward model benchmarks: |
|
|
|
| Metric | SmolTulu-1.7b-RM | Tulu 3 8b RM | |
|
|:-----------|:----------------:|:-------------:| |
|
| RB Chat | *94.13* | **96.27** | |
|
| RB Chat Hard | 43.64 | **55.92** | |
|
| RB Safety | *75.54* | **84.05** | |
|
| RB Reasoning | *68.01* | **76.50** | |
|
| RB Average | *72.43* | **81.34** | |
|
| UFB | *73.17* | **77.34** | |
|
|
|
While the 1.7B reward model shows lower performance compared to the larger 8B model as expected, it still demonstrates strong capabilities across different evaluation categories, particularly in chat quality assessment. |
|
|
|
## Usage |
|
|
|
The reward model can be used with the transformers library: |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
checkpoint = "SultanR/SmolTulu-1.7b-RM" |
|
device = "cuda" # for GPU usage or "cpu" for CPU usage |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device) |
|
|
|
# Example of computing reward for a completion |
|
def get_reward(prompt, completion): |
|
inputs = tokenizer(prompt + completion, return_tensors="pt").to(device) |
|
reward = model(**inputs).logits[0].item() |
|
return reward |
|
``` |
|
|
|
## Training Details |
|
|
|
The reward model was trained with the following settings: |
|
- Base model: SmolTulu-1.7b-Instruct |
|
- Mixed precision: bfloat16 |
|
- Learning rate: 4e-5 |
|
- Effective batch size: 4 |
|
- Maximum sequence length: 2048 tokens |
|
- Maximum prompt length: 2048 tokens |
|
- Training epochs: 1 |
|
- Training data: Tulu 3 8B preference mixture |
|
- Evaluation data: UltraFeedback (cleaned) |
|
- Gradient checkpointing enabled |
|
- DeepSpeed Zero-3 for memory optimization |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{alrashed2024smoltuluhigherlearningrate, |
|
title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs}, |
|
author={Sultan Alrashed}, |
|
year={2024}, |
|
eprint={2412.08347}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2412.08347}, |
|
} |
|
``` |
|
|
|
The training methodology follows the Tulu 3 paper: |
|
|
|
``` |
|
@article{lambert2024tulu3, |
|
title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training}, |
|
author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others}, |
|
year={2024}, |
|
journal={arXiv preprint arXiv:2411.15124} |
|
} |
|
``` |