model rename
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ tags:
|
|
14 |
<img src="https://github.com/InternLM/InternLM/assets/22529082/b9788105-8892-4398-8b47-b513a292378e" width="200"/>
|
15 |
<div> </div>
|
16 |
<div align="center">
|
17 |
-
<b><font size="5">
|
18 |
</div>
|
19 |
|
20 |
|
@@ -29,22 +29,22 @@ tags:
|
|
29 |
|
30 |
## Introduction
|
31 |
|
32 |
-
**
|
33 |
|
34 |
### Key Features:
|
35 |
-
- **Variety of Sizes Available**: Our open-sourced reward models are available in sizes of **1.8B, 7B, and 20B**, each demonstrating exceptional performance across various metrics.
|
36 |
- **Comprehensive Coverage of Preference**: Trained with **2.4 million** preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
|
37 |
-
- **Multilingual Support**:
|
38 |
|
39 |
-
This model was applied to the
|
40 |
|
41 |
## Performance Evaluation on RewardBench
|
42 |
|
43 |
| Models | Score | Chat | Chat Hard | Safety | Reasoning |
|
44 |
| --- | --- | --- | --- | --- | --- |
|
45 |
-
|
|
46 |
-
|
|
47 |
-
|
|
48 |
|
49 |
- The evaluation is conducted on the [RewardBench](https://github.com/allenai/reward-bench) dataset.
|
50 |
- For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.
|
@@ -60,12 +60,12 @@ import torch
|
|
60 |
from transformers import AutoModel, AutoTokenizer
|
61 |
|
62 |
model = AutoModel.from_pretrained(
|
63 |
-
"internlm/
|
64 |
device_map="cuda",
|
65 |
torch_dtype=torch.float16,
|
66 |
trust_remote_code=True,
|
67 |
)
|
68 |
-
tokenizer = AutoTokenizer.from_pretrained("internlm/
|
69 |
|
70 |
chat_1 = [
|
71 |
{"role": "user", "content": "Hello! What's your name?"},
|
@@ -125,12 +125,12 @@ llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trus
|
|
125 |
|
126 |
# prepare the reward model and tokenizer
|
127 |
reward = AutoModel.from_pretrained(
|
128 |
-
"internlm/
|
129 |
device_map="cuda",
|
130 |
torch_dtype=torch.float16,
|
131 |
trust_remote_code=True,
|
132 |
)
|
133 |
-
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/
|
134 |
|
135 |
# prepare the chat prompt
|
136 |
prompt = "Write an article about the artificial intelligence revolution."
|
@@ -191,12 +191,12 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
|
|
191 |
```
|
192 |
## 简介
|
193 |
|
194 |
-
**
|
195 |
|
196 |
-
###
|
197 |
-
- **多种尺寸可供选择**:我们开源的奖励模型有 1.8B、7B 和 20B
|
198 |
-
- **全面覆盖偏好**:模型训练了 240
|
199 |
-
- **多语言支持**:
|
200 |
|
201 |
该模型运用在了 InternLM2-Chat 的 PPO 训练过程中。我们的[技术报告](https://arxiv.org/abs/2403.17297)中提出的 Reward Model 训练技巧已在 XTuner 中公开。欢迎点击[链接](https://github.com/InternLM/xtuner)进行尝试!
|
202 |
|
@@ -204,9 +204,9 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
|
|
204 |
|
205 |
| Models | Score | Chat | Chat Hard | Safety | Reasoning |
|
206 |
| --- | --- | --- | --- | --- | --- |
|
207 |
-
|
|
208 |
-
|
|
209 |
-
|
|
210 |
|
211 |
- 评估使用了 [RewardBench](https://github.com/allenai/reward-bench) 数据集进行。
|
212 |
- 为了公平比较,测试期间没有使用我们技术报告中提出的"条件系统提示"。
|
@@ -215,19 +215,19 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
|
|
215 |
|
216 |
### 基本用法
|
217 |
|
218 |
-
我们为您提供了一些用户友好的 API 以便使用该模型。以下是一些示例,展示如何使用
|
219 |
|
220 |
```python
|
221 |
import torch
|
222 |
from transformers import AutoModel, AutoTokenizer
|
223 |
|
224 |
model = AutoModel.from_pretrained(
|
225 |
-
"internlm/
|
226 |
device_map="cuda",
|
227 |
torch_dtype=torch.float16,
|
228 |
trust_remote_code=True,
|
229 |
)
|
230 |
-
tokenizer = AutoTokenizer.from_pretrained("internlm/
|
231 |
|
232 |
chat_1 = [
|
233 |
{"role": "user", "content": "Hello! What's your name?"},
|
@@ -269,7 +269,7 @@ print("rank_res: ", rank_res) # 排名序号越低表示分数越高
|
|
269 |
|
270 |
### Best of N 采样
|
271 |
|
272 |
-
以下是如何使用
|
273 |
以下代码演示了如何从语言模型生成的候选回答中选择最佳回答。
|
274 |
|
275 |
```python
|
@@ -287,12 +287,12 @@ llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trus
|
|
287 |
|
288 |
# 准备奖励模型和分词器
|
289 |
reward = AutoModel.from_pretrained(
|
290 |
-
"internlm/
|
291 |
device_map="cuda",
|
292 |
torch_dtype=torch.float16,
|
293 |
trust_remote_code=True,
|
294 |
)
|
295 |
-
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/
|
296 |
|
297 |
# 准备提示词
|
298 |
prompt = "Write an article about the artificial intelligence revolution."
|
|
|
14 |
<img src="https://github.com/InternLM/InternLM/assets/22529082/b9788105-8892-4398-8b47-b513a292378e" width="200"/>
|
15 |
<div> </div>
|
16 |
<div align="center">
|
17 |
+
<b><font size="5">InternLM2-7B-Reward</font></b>
|
18 |
</div>
|
19 |
|
20 |
|
|
|
29 |
|
30 |
## Introduction
|
31 |
|
32 |
+
**InternLM2-7B-Reward** is a reward model trained on the foundation of InternLM2-Chat-7B-SFT. This model has been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpful and harmless.
|
33 |
|
34 |
### Key Features:
|
35 |
+
- **Variety of Sizes Available**: Our open-sourced reward models are available in sizes of **1.8B, 7B, and 20B**, each demonstrating exceptional performance across various metrics. We aim for these different-sized models to facilitate research on the scaling laws of reward models, providing valuable insights to the community.
|
36 |
- **Comprehensive Coverage of Preference**: Trained with **2.4 million** preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
|
37 |
+
- **Multilingual Support**: InternLM2-Reward was trained on high-quality **English and Chinese** preference data, delivering robust performance in both languages.
|
38 |
|
39 |
+
This model was applied to the RLHF training process of InternLM2-Chat. The reward model training techniques from the [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297) have been open-sourced in XTuner, try it out [here](https://github.com/InternLM/xtuner)!
|
40 |
|
41 |
## Performance Evaluation on RewardBench
|
42 |
|
43 |
| Models | Score | Chat | Chat Hard | Safety | Reasoning |
|
44 |
| --- | --- | --- | --- | --- | --- |
|
45 |
+
| InternLM2-20B-Reward | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
|
46 |
+
| InternLM2-7B-Reward | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
|
47 |
+
| InternLM2-1.8B-Reward | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
|
48 |
|
49 |
- The evaluation is conducted on the [RewardBench](https://github.com/allenai/reward-bench) dataset.
|
50 |
- For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.
|
|
|
60 |
from transformers import AutoModel, AutoTokenizer
|
61 |
|
62 |
model = AutoModel.from_pretrained(
|
63 |
+
"internlm/internlm2-7b-reward",
|
64 |
device_map="cuda",
|
65 |
torch_dtype=torch.float16,
|
66 |
trust_remote_code=True,
|
67 |
)
|
68 |
+
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-7b-reward", trust_remote_code=True)
|
69 |
|
70 |
chat_1 = [
|
71 |
{"role": "user", "content": "Hello! What's your name?"},
|
|
|
125 |
|
126 |
# prepare the reward model and tokenizer
|
127 |
reward = AutoModel.from_pretrained(
|
128 |
+
"internlm/internlm2-7b-reward",
|
129 |
device_map="cuda",
|
130 |
torch_dtype=torch.float16,
|
131 |
trust_remote_code=True,
|
132 |
)
|
133 |
+
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-7b-reward", trust_remote_code=True)
|
134 |
|
135 |
# prepare the chat prompt
|
136 |
prompt = "Write an article about the artificial intelligence revolution."
|
|
|
191 |
```
|
192 |
## 简介
|
193 |
|
194 |
+
**InternLM2-7B-Reward** 是基于 **InternLM2-Chat-7B-SFT** 训练的奖励模型。该模型使用超过 240 万条人工标注和 AI 合成的偏好样本,覆盖了包括对话、写作、诗歌、总结、编码和数学等多个领域。在取得了出色性能的同时也兼顾了实用性和安全性偏好的平衡。
|
195 |
|
196 |
+
### InternLM2-Reward 的主要特点:
|
197 |
+
- **多种尺寸可供选择**:我们开源的奖励模型有 **1.8B、7B 和 20B** 三种尺寸,每种尺寸都展示出了卓越的性能。我们希望这些不同大小的模型能够促进社区关于 Reward Model 缩放定律的研究。
|
198 |
+
- **全面覆盖偏好**:模型训练了 **240 万**条来自人工标注和AI合成的偏好样本,涉及对话、写作、诗歌、总结、编码和数学等多个领域,同时确保了实用性和安全性偏好的平衡。
|
199 |
+
- **多语言支持**:InternLM2-Reward 在高质量的**英文和中文**偏好数据上进行训练,确保了在这两种语言上都有稳健的表现。
|
200 |
|
201 |
该模型运用在了 InternLM2-Chat 的 PPO 训练过程中。我们的[技术报告](https://arxiv.org/abs/2403.17297)中提出的 Reward Model 训练技巧已在 XTuner 中公开。欢迎点击[链接](https://github.com/InternLM/xtuner)进行尝试!
|
202 |
|
|
|
204 |
|
205 |
| Models | Score | Chat | Chat Hard | Safety | Reasoning |
|
206 |
| --- | --- | --- | --- | --- | --- |
|
207 |
+
| InternLM2-20B-Reward | 89.5 | 98.6 | 74.1 | 89.4 | 95.7 |
|
208 |
+
| InternLM2-7B-Reward | 86.6 | 98.6 | 66.7 | 88.3 | 92.8 |
|
209 |
+
| InternLM2-1.8B-Reward | 80.6 | 95.0 | 58.1 | 81.8 | 87.4 |
|
210 |
|
211 |
- 评估使用了 [RewardBench](https://github.com/allenai/reward-bench) 数据集进行。
|
212 |
- 为了公平比较,测试期间没有使用我们技术报告中提出的"条件系统提示"。
|
|
|
215 |
|
216 |
### 基本用法
|
217 |
|
218 |
+
我们为您提供了一些用户友好的 API 以便使用该模型。以下是一些示例,展示如何使用 InternLM2-Reward 获取对话的奖励分数、比较两组对话或对多个对话进行排名。
|
219 |
|
220 |
```python
|
221 |
import torch
|
222 |
from transformers import AutoModel, AutoTokenizer
|
223 |
|
224 |
model = AutoModel.from_pretrained(
|
225 |
+
"internlm/internlm2-7b-reward",
|
226 |
device_map="cuda",
|
227 |
torch_dtype=torch.float16,
|
228 |
trust_remote_code=True,
|
229 |
)
|
230 |
+
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-7b-reward", trust_remote_code=True)
|
231 |
|
232 |
chat_1 = [
|
233 |
{"role": "user", "content": "Hello! What's your name?"},
|
|
|
269 |
|
270 |
### Best of N 采样
|
271 |
|
272 |
+
以下是如何使用 InternLM2-Reward 执行Best of N 采样的示例。
|
273 |
以下代码演示了如何从语言模型生成的候选回答中选择最佳回答。
|
274 |
|
275 |
```python
|
|
|
287 |
|
288 |
# 准备奖励模型和分词器
|
289 |
reward = AutoModel.from_pretrained(
|
290 |
+
"internlm/internlm2-7b-reward",
|
291 |
device_map="cuda",
|
292 |
torch_dtype=torch.float16,
|
293 |
trust_remote_code=True,
|
294 |
)
|
295 |
+
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-7b-reward", trust_remote_code=True)
|
296 |
|
297 |
# 准备提示词
|
298 |
prompt = "Write an article about the artificial intelligence revolution."
|
reward_bench_results/eval-set/{internlm-reward-7b.json → internlm2-7b-reward.json}
RENAMED
@@ -16,7 +16,7 @@
|
|
16 |
"llmbar-adver-neighbor": 0.5522388059701493,
|
17 |
"llmbar-natural": 0.89,
|
18 |
"math-prm": 0.9194630872483222,
|
19 |
-
"model": "internlm/
|
20 |
"model_type": "Seq. Classifier",
|
21 |
"mt-bench-easy": 1.0,
|
22 |
"mt-bench-hard": 0.7567567567567568,
|
|
|
16 |
"llmbar-adver-neighbor": 0.5522388059701493,
|
17 |
"llmbar-natural": 0.89,
|
18 |
"math-prm": 0.9194630872483222,
|
19 |
+
"model": "internlm/internlm2-7b-reward",
|
20 |
"model_type": "Seq. Classifier",
|
21 |
"mt-bench-easy": 1.0,
|
22 |
"mt-bench-hard": 0.7567567567567568,
|
reward_bench_results/pref-sets/{internlm-reward-7b.json → internlm2-7b-reward.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"anthropic_helpful": 0.6971899224806202,
|
4 |
"anthropic_hhh": 0.8552036199095022,
|
5 |
"chat_template": "tokenizer",
|
6 |
-
"model": "internlm/
|
7 |
"model_type": "Seq. Classifier",
|
8 |
"mtbench_gpt4": 0.89125,
|
9 |
"mtbench_human": 0.7323397913561848,
|
|
|
3 |
"anthropic_helpful": 0.6971899224806202,
|
4 |
"anthropic_hhh": 0.8552036199095022,
|
5 |
"chat_template": "tokenizer",
|
6 |
+
"model": "internlm/internlm2-7b-reward",
|
7 |
"model_type": "Seq. Classifier",
|
8 |
"mtbench_gpt4": 0.89125,
|
9 |
"mtbench_human": 0.7323397913561848,
|