keyfan commited on
Commit
1dc5a0a
1 Parent(s): 358c051

Update evaluation result

Browse files
README.md CHANGED
@@ -14,7 +14,7 @@ language:
14
  This is an attempt to replicate the RLHF pipeline
15
 
16
  ### Base Model
17
-
18
  We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.
19
 
20
  ### Supervised Fintune
@@ -34,8 +34,9 @@ This is an attempt to replicate the RLHF pipeline
34
 
35
  ### Reinforcement Learning
36
 
37
- For RL we used the code of [trlx](https://github.com/CarperAI/trlx) and prompts from
38
- - [fnlp/moss-002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data/tree/main)
 
39
 
40
  ### Example
41
 
@@ -60,18 +61,10 @@ This is an attempt to replicate the RLHF pipeline
60
 
61
  ### Evalutions
62
 
63
- Result on the English [Vicuna eval set](https://github.com/lm-sys/FastChat/tree/main/fastchat/eval)
64
-
65
- ChatGPT score: 662.5; Bloomz score: 535.0 (81%)
66
-
67
- | category | generic | knowledge | roleplay | common-sense | fermi | counterfactual | coding | math | writing |
68
- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
69
- | chatgpt avg score | 8.05 | 8.15 | 8.30 | 8.10 | 8.30 | 8.10 | 8.29 | 10.0 | 8.45 |
70
- | bloomz avg score | 7.95 | 8.05 | 6.80 | 6.95 | 4.20 | 6.95 | 6.14 | 3.33 | 7.30 |
71
- * We don't have access to GPT-4 API, so the result comes from GPT-4 interface which may not be exactly the same.
72
-
73
  Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
74
 
75
  | others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
76
  | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
77
- | 0.617 | 0.900 | 0.715 | 0.932 | 0.733 | 0.597 | 0.537 | 0.899 | 0.552 | 0.720 | 0.733 |
 
 
 
14
  This is an attempt to replicate the RLHF pipeline
15
 
16
  ### Base Model
17
+
18
  We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.
19
 
20
  ### Supervised Fintune
 
34
 
35
  ### Reinforcement Learning
36
 
37
+ For RL we used the code of [trlx](https://github.com/CarperAI/trlx) with slight modification.
38
+
39
+ Instead of building value network upon the policy network with a single linear layer, we add another hydra head upon the reference network's frozen bottom layers as value network.
40
 
41
  ### Example
42
 
 
61
 
62
  ### Evalutions
63
 
 
 
 
 
 
 
 
 
 
 
64
  Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
65
 
66
  | others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
67
  | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
68
+ | 0.619 | 0.873 | 0.706 | 0.934 | 0.755 | 0.619 | 0.527 | 0.908 | 0.615 | 0.728 | 0.742 |
69
+
70
+ * We found in GPT-4 evaluation the order in which the responses were presented has unneglectable affect on the final score even with the very-well designed Vicuna prompt. So we removed the score on the Vicuna eval set.
pytorch_model-00001-of-00002.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a9bce24cc1e1b2bc1d4d7a38399ce5666cb955ecc8fe1a106ea883de58c99f1c
3
- size 10848957472
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82d67a77ce1b7f68d40c5a97de78feaa151e094be4938290d26e6ccc1e46ec1c
3
+ size 18542818872
pytorch_model-00002-of-00002.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5fb6f7ea9c92823246b6eaedc807a60539caf1a43010d6d07a6e4a47c01dbe34
3
- size 6284244953
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d331cabaef2b2e58e833cf587ad12d4a1bb8085de7b27c08764ce1a21144ce8
3
+ size 11561532465