File size: 2,911 Bytes
16b9638
 
aab5177
 
 
 
 
 
 
 
 
16b9638
aab5177
 
 
 
1dc5a0a
aab5177
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1dc5a0a
 
 
b250fa3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
358c051
 
 
 
 
 
 
1dc5a0a
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
license: bigscience-bloom-rail-1.0
datasets:
- OpenAssistant/oasst1
- RyokoAI/ShareGPT52K
- Dahoas/full-hh-rlhf
- liswei/rm-static-m2m100-zh
- fnlp/moss-002-sft-data
language:
- zh
- en
---

This is an attempt to replicate the RLHF pipeline

### Base Model

  We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.

### Supervised Fintune

  For SFT we used a combination of multiple datasets including:
  - [RyokoAI/ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)
  - [GPTeacher](https://github.com/teknium1/GPTeacher)
  - [Alpaca-GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) en & zh
  - Filtered subset of machine-translated ShareGPT dataset into Chinese

### Reward Model

  For RM we used the code of [reward-modeling](https://github.com/Dahoas/reward-modeling) repo and datasets from
  - [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
  - [Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
  - [liswei/rm-static-m2m100-zh](https://huggingface.co/datasets/liswei/rm-static-m2m100-zh)

### Reinforcement Learning

  For RL we used the code of [trlx](https://github.com/CarperAI/trlx) with slight modification.

  Instead of building value network upon the policy network with a single linear layer, we add another hydra head upon the reference network's frozen bottom layers as value network.

### Example

  We used Vicuna v1.1 template for model training

  ```
  from transformers import AutoModelForCausalLM, AutoTokenizer

  checkpoint = "keyfan/bloomz-rlhf"

  tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda()

  template = ("A chat between a curious human and an artificial intelligence assistant. "
              "The assistant gives helpful, detailed, and polite answers to the human's questions. "
              "USER: {}\nASSISTANT:")
  question = template.format("Who was the president of the United States in 1955?")
  inputs = tokenizer.encode(question, return_tensors="pt").cuda()
  outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
  print(tokenizer.decode(outputs[0]))
  ```

### Evalutions

Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)

| others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 0.619 | 0.873 | 0.706 | 0.934 | 0.755 | 0.619 | 0.527 | 0.908 | 0.615 | 0.728 | 0.742 |

* We found in GPT-4 evaluation the order in which the responses were presented has unneglectable affect on the final score even with the very-well designed Vicuna prompt. So we removed the score on the Vicuna eval set.