Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ This model doesn't dramatically improve on the base model's general task perform
|
|
18 |
|
19 |
# Evaluations
|
20 |
|
21 |
-
We've only done
|
22 |
|
23 |
## HuggingFaceH4 Open LLM Leaderboard Performance
|
24 |
|
@@ -29,6 +29,9 @@ The only significant improvement was with TruthfulQA.
|
|
29 |
|
30 |
## MT-bench Performance
|
31 |
|
|
|
|
|
|
|
32 |
| Epoch | Average | Turn 1 | Turn 2 |
|
33 |
|:----------|:----------|:----------|:----------|
|
34 |
| 3 | 4.85 | 5.69 | 4.01 |
|
|
|
18 |
|
19 |
# Evaluations
|
20 |
|
21 |
+
We've only done limited testing as yet. The [epoch 3.5 checkpoint](https://huggingface.co/Open-Orca/oo-phi-1_5/commit/f7754d8b8b4c3e0748eaf47be4cf5aac1f80a401) scores above 5.1 on MT-Bench (better than Alpaca-13B, worse than Llama2-7b-chat), while preliminary benchmarks suggest peak average performance was achieved roughly at epoch 4.
|
22 |
|
23 |
## HuggingFaceH4 Open LLM Leaderboard Performance
|
24 |
|
|
|
29 |
|
30 |
## MT-bench Performance
|
31 |
|
32 |
+
|
33 |
+
![MT-bench Score](https://huggingface.co/Open-Orca/oo-phi-1_5/resolve/main/Images/oo-phi-1_5-mtbench.png)
|
34 |
+
|
35 |
| Epoch | Average | Turn 1 | Turn 2 |
|
36 |
|:----------|:----------|:----------|:----------|
|
37 |
| 3 | 4.85 | 5.69 | 4.01 |
|