keyfan
/

bloomz-rlhf

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

keyfan commited on May 4, 2023

Commit

358c051

·

1 Parent(s): b250fa3

Add evaluation result

Files changed (1) hide show

README.md +18 -0

README.md CHANGED Viewed

@@ -57,3 +57,21 @@ This is an attempt to replicate the RLHF pipeline
   outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
   print(tokenizer.decode(outputs[0]))
   ```

   outputs = model.generate(inputs, do_sample=True, top_p=0.8, max_new_tokens=512)
   print(tokenizer.decode(outputs[0]))
   ```
+### Evalutions
+Result on the English [Vicuna eval set](https://github.com/lm-sys/FastChat/tree/main/fastchat/eval)
+ChatGPT score: 662.5; Bloomz score: 535.0 (81%)
+| category | generic | knowledge | roleplay | common-sense | fermi | counterfactual | coding | math | writing |
+| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| chatgpt avg score | 8.05 | 8.15 | 8.30 | 8.10 | 8.30 | 8.10 | 8.29 | 10.0 | 8.45 |
+| bloomz avg score | 7.95 | 8.05 | 6.80 | 6.95 | 4.20 | 6.95 | 6.14 | 3.33 | 7.30 |
+* We don't have access to GPT-4 API, so the result comes from GPT-4 interface which may not be exactly the same.
+Result on the Chinese [BELLE eval set](https://github.com/LianjiaTech/BELLE/tree/main/eval)
+| others | rewrite | classification | generation | summarization | extract | open qa | brainstorming | closed qa | macro ave | macro ave w/o others |
+| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| 0.617 | 0.900 | 0.715 | 0.932 | 0.733 | 0.597 | 0.537 | 0.899 | 0.552 | 0.720 | 0.733 |