keyfan commited on
Commit
358c051
·
1 Parent(s): b250fa3

Add evaluation result

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -57,3 +57,21 @@ This is an attempt to replicate the RLHF pipeline
57
  outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
58
  print(tokenizer.decode(outputs[0]))
59
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
58
  print(tokenizer.decode(outputs[0]))
59
  ```
60
+
61
+ ### Evalutions
62
+
63
+ Result on the English [Vicuna eval set](https://github.com/lm-sys/FastChat/tree/main/fastchat/eval)
64
+
65
+ ChatGPT score: 662.5; Bloomz score: 535.0 (81%)
66
+
67
+ | category | generic | knowledge | roleplay | common-sense | fermi | counterfactual | coding | math | writing |
68
+ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
69
+ | chatgpt avg score | 8.05 | 8.15 | 8.30 | 8.10 | 8.30 | 8.10 | 8.29 | 10.0 | 8.45 |
70
+ | bloomz avg score | 7.95 | 8.05 | 6.80 | 6.95 | 4.20 | 6.95 | 6.14 | 3.33 | 7.30 |
71
+ * We don't have access to GPT-4 API, so the result comes from GPT-4 interface which may not be exactly the same.
72
+
73
+ Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
74
+
75
+ | others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
76
+ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
77
+ | 0.617 | 0.900 | 0.715 | 0.932 | 0.733 | 0.597 | 0.537 | 0.899 | 0.552 | 0.720 | 0.733 |