SultanR
/

SmolTulu-1.7b-Reinforced

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

SmolTulu-1.7b-Reinforced / README.md

SultanR's picture

Update README.md

530b6c0 verified 4 days ago

|

2.91 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	tags:
	- Tulu3
	- Smollm
	- SLMs
	- Small
	- Huggingface
	- Allenai
	- SFT
	- DPO
	- GGUF
	- RLVR
	- RL
	base_model:
	- SultanR/SmolTulu-1.7b-Instruct
	datasets:
	- allenai/RLVR-GSM-MATH-IF-Mixed-Constraints
	pipeline_tag: text-generation
	---

	# SmolLM2 1.7b Aligned and Reinforced Through Tulu 3!

	![SmolTulu Banner](smoltulubanner.png)

	SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of [SmolTulu-1.7b-Instruct](https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct), which leverages [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124)

	This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
	## Evaluation

	I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.


	\| Metric \| SmolTulu-1.7b-Instruct \| SmolTulu-1.7b-Reinforced \| SmolLM2-1.7B-Instruct \| Llama-1B-Instruct \| Qwen2.5-1.5B-Instruct \| SmolLM1-1.7B-Instruct \|
	\|:----------------------------\|:---------------------:\|:---------------------:\|:---------------------:\|:---------------------:\|:---------------------:\|:---------------------:\|
	\| ARC (Average) \| 51.5 \| 51.1 \| 51.7 \| 41.6 \| 46.2 \| 43.7 \|
	\| BBH (3-shot) \| 33.8 \| 33.4 \| 32.2 \| 27.6 \| 35.3 \| 25.7 \|
	\| GSM8K (5-shot) \| 51.6 \| 61.0 \| 48.2 \| 26.8 \| 42.8 \| 4.6 \|
	\| HellaSwag \| 61.1 \| 60.4 \| 66.1 \| 56.1 \| 60.9 \| 55.5 \|
	\| IFEval (Average prompt/inst) \| 67.7 \| 69.3 \| 56.7 \| 53.5 \| 47.4 \| 23.1 \|
	\| MMLU-Pro (MCF) \| 17.4 \| 17.3 \| 19.3 \| 12.7 \| 24.2 \| 11.7 \|
	\| PIQA \| 72.2 \| 72.1 \| 74.4 \| 72.3 \| 73.2 \| 71.6 \|

	## Usage

	Just like any Huggingface model, just run it using the transformers library:

	```python
	# pip install transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer
	checkpoint = "SultanR/SmolTulu-1.7b-Reinforced"
	device = "cuda" # for GPU usage or "cpu" for CPU usage
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
	model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
	inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
	outputs = model.generate(inputs)
	print(tokenizer.decode(outputs[0]))
	```

	## Citation

	```
	@misc{alrashed2024smoltuluhigherlearningrate,
	title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs},
	author={Sultan Alrashed},
	year={2024},
	eprint={2412.08347},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2412.08347},
	}
	```