Add evaluation result
Browse files
README.md
CHANGED
@@ -57,3 +57,21 @@ This is an attempt to replicate the RLHF pipeline
|
|
57 |
outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
|
58 |
print(tokenizer.decode(outputs[0]))
|
59 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
|
58 |
print(tokenizer.decode(outputs[0]))
|
59 |
```
|
60 |
+
|
61 |
+
### Evalutions
|
62 |
+
|
63 |
+
Result on the English [Vicuna eval set](https://github.com/lm-sys/FastChat/tree/main/fastchat/eval)
|
64 |
+
|
65 |
+
ChatGPT score: 662.5; Bloomz score: 535.0 (81%)
|
66 |
+
|
67 |
+
| category | generic | knowledge | roleplay | common-sense | fermi | counterfactual | coding | math | writing |
|
68 |
+
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
69 |
+
| chatgpt avg score | 8.05 | 8.15 | 8.30 | 8.10 | 8.30 | 8.10 | 8.29 | 10.0 | 8.45 |
|
70 |
+
| bloomz avg score | 7.95 | 8.05 | 6.80 | 6.95 | 4.20 | 6.95 | 6.14 | 3.33 | 7.30 |
|
71 |
+
* We don't have access to GPT-4 API, so the result comes from GPT-4 interface which may not be exactly the same.
|
72 |
+
|
73 |
+
Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
|
74 |
+
|
75 |
+
| others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
|
76 |
+
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
77 |
+
| 0.617 | 0.900 | 0.715 | 0.932 | 0.733 | 0.597 | 0.537 | 0.899 | 0.552 | 0.720 | 0.733 |
|