File size: 2,667 Bytes
14a1057
 
 
 
 
1d12686
14a1057
1d12686
14a1057
 
1d12686
14a1057
1d12686
 
 
 
 
 
14a1057
1d12686
14a1057
1d12686
 
 
 
 
 
 
 
14a1057
1d12686
14a1057
 
1d12686
14a1057
1d12686
14a1057
f645c10
14a1057
1d12686
14a1057
1d12686
14a1057
1d12686
 
 
 
 
 
 
 
14a1057
1d12686
14a1057
1d12686
14a1057
1d12686
 
 
 
 
 
 
 
 
 
14a1057
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
library_name: transformers
tags: []
---

This is a process-supervised reward (PRM) trained on Mistral-generated data from the project [RLHFlow/RLHF-Reward-Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling)

The model is trained from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on [RLHFlow/Deepseek-PRM-Data](https://huggingface.co/datasets/RLHFlow/Deepseek-PRM-Data) for 1 epochs. We use a global batch size of 32 and a learning rate of 2e-6, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/math/llama-3.1-prm.yaml .


## BoN evaluation result for Mistral generator:

| Model      | Method     | GSM8K     | MATH |
| ------------- | ------------- | ------------- | -------- |
| Mistral-7B | Pass@1 | 77.9 |  28.4   |
| Mistral-7B | Majority Voting@1024 | 84.2 | 36.8  |
| Mistral-7B | Mistral-ORM@1024 | 90.1 | 43.6 |
| Mistral-7B | Mistral-PRM@1024 | 92.4 | 46.3 |

## Scaling the inference sampling to N=1024 for Deepseek generator:

| Model         | Method                    | GSM8K | MATH |
| ------------- | ------------- | ------------- | -------- |
| Deepseek-7B | Pass@1 | 83.9 | 38.4 |
| Deepseek-7B | Majority Voting@1024 | 89.7 | 57.4  |
| Deepseek-7B | Deepseek-ORM@1024 | 93.4 | 52.4 |
| Deepseek-7B | Deepseek-PRM@1024 | 93.0 | 58.1 |
| Deepseek-7B | Mistral-ORM@1024 (OOD) | 90.3 | 54.9 |
| Deepseek-7B | Mistral-PRM@1024 (OOD) | 91.9 | 56.9 |

## Visualization


![image/png](https://cdn-uploads.huggingface.co/production/uploads/643e59806db6ba8c5ee123f3/i622m76fvKv8drLmwl8Q3.png)

## Usage 

See https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/math-rm for detailed examples. 

## Citation

The automatic annotation was proposed in the Math-shepherd paper:

```
@inproceedings{wang2024math,
  title={Math-shepherd: Verify and reinforce llms step-by-step without human annotations},
  author={Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={9426--9439},
  year={2024}
}

```

If you find the training recipe useful, please consider cite it as follows.

```
@misc{xiong2024rlhflowmath,
      author={Wei Xiong and Hanning Zhang and Nan Jiang and Tong Zhang},
  title = {An Implementation of Generative PRM},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/RLHFlow/RLHF-Reward-Modeling}}
}
```