File size: 8,661 Bytes
33de8ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
<h1 align="center">
  <u>C</u>ritique-out-<u>Loud</u> Reward Models (CLoud)
</h1>
<p align="center">
  <img src="assets/CLoud.gif" alt="CLoud"/>
</p>


<p align="center">
| <a href="https://arxiv.org/abs/2408.11791"><b>Paper</b></a> | <a href="https://x.com/ZackAnkner/status/1826607200376336478"> <b>Tweet</b> </a> |
</p>

---
## Introduction

<u>C</u>ritique-out-<u>Loud</u> reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward.
In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM.
Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly.
In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces.
CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.

## Todo

- [x] Release models and inference examples
- [ ] Post example training run logs
- [ ] Add ArenaHard evaluation code
- [ ] Add VLLM support for inference

## Table of Contents
- [Introduction](#introduction)
- [Todo](#todo)
- [Table of Contents](#table-of-contents)
- [Setup](#setup)
- [Model Weights](#model-weights)
- [Inference](#inference)
- [Dataset](#dataset)
- [Training](#training)
  - [CLoud Training](#cloud-training)
  - [Classic Training](#classic-training)
- [Evaluation](#evaluation)
- [Citation](#citation)

## Setup
```bash
git clone https://github.com/zankner/CLoud
cd CLoud
pip install -e .
```
Optional: base docker image used during development `mosaicml/pytorch:2.3.0_cu121-python3.11-ubuntu20.04`

## Model Weights

| Base Model   |   RM Type      | Hugging Face Repo                                                     |
| ----------  | --------------- |--------------------------------------------------------------------- |
| Llama3-8B   | Classic | [ankner/Llama3-8B-Classic-RM](https://huggingface.co/ankner/Llama3-8B-Classic-RM)   |
| Llama3-8B   |  CLoud | [ankner/Llama3-8B-CLoud-RM](https://huggingface.co/ankner/Llama3-8B-CLoud-RM)   |
| Llama3-70B  |  Classic | [ankner/Llama3-70B-Classic-RM](https://huggingface.co/ankner/Llama3-70B-Classic-RM) |
| Llama3-70B  |  CLoud | [ankner/Llama3-70B-CLoud-RM](https://huggingface.co/ankner/Llama3-70B-CLoud-RM) |

## Inference
We provide a gradio demo which can be run as follows: `gradio cloud/demo.py`. By default this will demo `ankner/Llama3-8B-CLoud-RM`, but you can change the model loaded in the script.

If you want to perform inference on your own data, please refer to the following example:
```python
from cloud.model import CLoudRewardModel
from transformers import AutoTokenizer

model_name = "ankner/Llama3-8B-Cloud-RM" # Replace with RM trained with this repo
model = CLoudRewardModel.from_pretrained(model_name, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

user_prompt = [
  "Write me a story", 
  "What is the capital of the moon?"
]
assistant_response = [
  "No I don't want to do that.", 
  "Since the moon is made out of cheese, the capital is mozzerella."
]

rewards, critiques = model.predict_reward(user_prompt, assistant_response, tokenizer)

for reward, critique in zip(rewards, critiques):
    print("Critique:")
    print(critique)
    print("Reward:")
    print(reward)
    print("=" * 100)
```

## Dataset

We provide code to reconstruct the datasets used in the paper.
There are two datasets to build for training, one with oracle critiques meant to simmulate human feedback and one with self-generated critiques.
To build the oracle critique dataset run:
```bash
python cloud/data/build_official_ultra_llama.py --mode oracle
```
To build the self-generated critique dataset run:
```bash
python cloud/data/build_official_ultra_llama.py --mode self-gen --model-size {model-size}
```
where ```{model-size}``` is the size of the model you are using (e.g. 8b, 70b).

<details>
<summary>Build your own dataset from scratch</summary>

1. <b>Build prompts</b> - You can use any dataset you like as long as it has ```prompt``` and ```id``` columns. If you would like to build prompts from UltraFeedback and UltraInteract as we do in the paper run:
    ```bash
    python cloud/data/build_ultra_prompts.py --save-name {name-to-save-as}
    ```
2. <b>Build chosen / rejected responses</b>
    ```bash
    python cloud/data/build_judgements.py --gen-model {model-generating-responses} --judge-model {model-judging-responses} --base-dataset {path-to-prompt-dataset} --save-name {name-to-save-as}
    ```
    The above command requires a hosted generating and judging model. To host the models using vllm run:
    ```bash
    python -m vllm.entrypoints.openai.api_server --model {path-to-gen/judge-model} --dtype bfloat16 --tensor-parallel-size {num-gpus} --port {8000 for gen and 8001 for judge}
    ```
3. <b>Build critiques</b>
    ```bash
    python cloud/data/generate_oracle_critiques.py --judge-model {model-generating-critiques} --base-dataset {path-to-responses-dataset} --save-name {name-to-save-as}
    ```
    Again, this command assumes a hosted critique model. To host the critique model you can use the above vllm command (This time just use port 8000 for the judge model).

</details>



## Training


Before training, you must run the [setup script](#setup) and build the [datasets](#dataset).
The training configs are located in the ```cloud/train/configs/``` folder.
We have already set the optimal hyperparameters that we found for each model as reported in the paper.
The only parameter that needs to be set is the ```variables.micro_batch_size``` parameter, in accordance with your GPU memory.

If you want to log the training runs, uncomment the ```loggers``` section in the config and fill in your wandb settings.

Checkpoints will be saved throughout training to the ```save_folder``` parameter, which is ```ckpts/${variables.run_name}``` by default. The final checkpoint will contain a folder ```hf``` where the huggingface model is saved.


> **Warning**: The below training scripts for both CLoud and Classic prefill the dataset names to be the datasets we release. If you would like to train on your own dataset, you will need to follow the directions to build said dataset in the [dataset section](#dataset) and change the ```variables.dataset_path``` parameter in the training configs.



### CLoud Training

1. The first step is to finetune the base model to produce critiques:

    ```bash
    composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_critique_sft.yaml
    ```
    Replace ```{model_size}``` with the size of the model you are training (e.g. 8b, 70b).

2. (Optional if you want to use the self-generated data we release) After the critique SFT model is trained, you need to regenerate the dataset with the critiques.
    To do so, you first need to serve the critique SFT model. To do so locally using vllm run:
    ```bash
    python -m vllm.entrypoints.openai.api_server --model {path-to-critique-sft-model} --dtype bfloat16 --tensor-parallel-size {num-gpus}
    ```
    Then run the data building script:
    ```bash
    python cloud/data/generate_self_critiques.py --model {path-to-critique-sft-model} --base-dataset {path-to-base-dataset} --upload-name {path-to-save-dataset}
    ```

3. After building the self-generated dataset, we can train the CLoud model:
    ```bash
    composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_cloud.yaml
    ```

### Classic Training

To train a classic reward model, you can use the following command:
```bash
composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_classic.yaml
```

## Evaluation

To run evaluation for a given benchmark run the following command:
```bash
python cloud/eval/eval.py --model-path {path-to-model} --benchmark {benchmark-name}
```
Currently, we only support the RewardBench benchmark.

## Citation
If you found our work useful please consider citing it:
```bibtex
@misc{ankner2024critiqueoutloudrewardmodels,
      title={Critique-out-Loud Reward Models}, 
      author={Zachary Ankner and Mansheej Paul and Brandon Cui and Jonathan D. Chang and Prithviraj Ammanabrolu},
      year={2024},
      eprint={2408.11791},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.11791}, 
}
```