This is the SFT checkpoint used for the project RLHFlow/Online-RLHF

The model is trained from meta-llama/Meta-Llama-3-8B on RLHFlow/RLHFlow-SFT-Dataset-ver2 for 2 epochs. We use a global batch size of 128 and a learning rate of 2e-5, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/sft/llama3-8b-it.yaml .

Academic Benchmarks

We use ToRA script to evaluate GSM8K and MATH, Evalplut for HumanEval, and lm-evaluation-harness for other benchmarks. The model is evaluated in zero-shot setting.

Model Size Method LC AlpacaEval MT-Bench GSM-8K MATH MMLU HumanEval TruthfulQA ARC
LLaMA-3-8B-it 8B RS+DPO+PPO 22.9 8.16 79.6 26.3 66.0 61.6 43.9 59.5
RLHFlow/LLaMA3-SFT 8B SFT 10.2 7.69 74.2 30.0 64.6 63.4 53.5 58.6
RLHFlow/LLaMA3-SFT-v2 8B SFT 12.66 - 83.4 41.1 64.8 66.5 53.9 60.0

Citation

Please cite our techical report if you find our model is useful for your research or product.

@misc{dong2024rlhf,
      title={RLHF Workflow: From Reward Modeling to Online RLHF}, 
      author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
      year={2024},
      eprint={2405.07863},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
Downloads last month
860
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for RLHFlow/LLaMA3-SFT-v2

Finetunes
5 models
Quantizations
4 models

Collection including RLHFlow/LLaMA3-SFT-v2