Update README.md
Browse files
README.md
CHANGED
@@ -144,6 +144,7 @@ assert tokens == [1, 420, 6316, 28781, 3198, 3123, 1247, 28747, 22557, 32000, 42
|
|
144 |
<div align="center">
|
145 |
<h2> (Experimental) Evaluator / Feedback Capabilities </h2>
|
146 |
</div>
|
|
|
147 |
We've included evaluator capabilities in this release to advance open-source models as evaluators. You can use `Default Mode (GPT4 Correct)` with the following prompt (same as [Prometheus](https://huggingface.co/datasets/kaist-ai/Feedback-Collection)) to evaluate a response.
|
148 |
|
149 |
```
|
@@ -191,6 +192,7 @@ Score 5: {orig_score5_description}
|
|
191 |
|
192 |
<details>
|
193 |
<summary>Evaluation Details(click to expand)</summary>
|
|
|
194 |
*: ChatGPT (March) results are from [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774), [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub), and our evaluation. Please note that ChatGPT is not a fixed baseline and evolves rapidly over time.
|
195 |
|
196 |
^: Zephyr-β often fails to follow few-shot CoT instructions, likely because it was aligned with only chat data but not trained on few-shot data.
|
@@ -198,6 +200,7 @@ Score 5: {orig_score5_description}
|
|
198 |
**: Mistral and Open-source SOTA results are taken from reported results in instruction-tuned model papers and official repositories.
|
199 |
|
200 |
All models are evaluated in chat mode (e.g. with the respective conversation template applied). All zero-shot benchmarks follow the same setting as in the AGIEval paper and Orca paper. CoT tasks use the same configuration as Chain-of-Thought Hub, HumanEval is evaluated with EvalPlus, and MT-bench is run using FastChat. To reproduce our results, follow the instructions in [our repository](https://github.com/imoneoi/openchat/#benchmarks).
|
|
|
201 |
</details>
|
202 |
<div>
|
203 |
<h3>HumanEval+</h3>
|
|
|
144 |
<div align="center">
|
145 |
<h2> (Experimental) Evaluator / Feedback Capabilities </h2>
|
146 |
</div>
|
147 |
+
|
148 |
We've included evaluator capabilities in this release to advance open-source models as evaluators. You can use `Default Mode (GPT4 Correct)` with the following prompt (same as [Prometheus](https://huggingface.co/datasets/kaist-ai/Feedback-Collection)) to evaluate a response.
|
149 |
|
150 |
```
|
|
|
192 |
|
193 |
<details>
|
194 |
<summary>Evaluation Details(click to expand)</summary>
|
195 |
+
|
196 |
*: ChatGPT (March) results are from [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774), [Chain-of-Thought Hub](https://github.com/FranxYao/chain-of-thought-hub), and our evaluation. Please note that ChatGPT is not a fixed baseline and evolves rapidly over time.
|
197 |
|
198 |
^: Zephyr-β often fails to follow few-shot CoT instructions, likely because it was aligned with only chat data but not trained on few-shot data.
|
|
|
200 |
**: Mistral and Open-source SOTA results are taken from reported results in instruction-tuned model papers and official repositories.
|
201 |
|
202 |
All models are evaluated in chat mode (e.g. with the respective conversation template applied). All zero-shot benchmarks follow the same setting as in the AGIEval paper and Orca paper. CoT tasks use the same configuration as Chain-of-Thought Hub, HumanEval is evaluated with EvalPlus, and MT-bench is run using FastChat. To reproduce our results, follow the instructions in [our repository](https://github.com/imoneoi/openchat/#benchmarks).
|
203 |
+
|
204 |
</details>
|
205 |
<div>
|
206 |
<h3>HumanEval+</h3>
|