Romain-Cosentino
commited on
Commit
•
b7e48ae
1
Parent(s):
88d0360
readme update
Browse files
README.md
CHANGED
@@ -82,43 +82,9 @@ MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These q
|
|
82 |
|
83 |
![hexplot.png](assets/hexplot.png)
|
84 |
|
85 |
-
### Comparison with additional Open LLM LeaderBoard models
|
86 |
-
| Model | First Turn | Second Turn | Average |
|
87 |
-
| --- | --- | --- | --- |
|
88 |
-
| TenyxChat-8x7B-v1 | 8.45000 | 7.756250 | 8.103125 |
|
89 |
-
| SamirGPT-v1 | 8.05000 | 7.612500 | 7.831250 |
|
90 |
-
| FernandoGPT-v1 | 8.08125 | 7.256250 | 7.668750 |
|
91 |
-
| Go-Bruins-v2 | 8.13750 | 7.150000 | 7.643750 |
|
92 |
-
| mistral_tv-neural-marconroni | 7.76875 | 6.987500 | 7.378125 |
|
93 |
-
| neuronovo-7B-v0.2 | 7.73750 | 6.662500 | 7.200000 |
|
94 |
-
| neural-chat-7b-v3-3 | 7.39375 | 5.881250 | 6.637500 |
|
95 |
-
|
96 |
-
## LM Evaluation - Open LLM Leaderboard
|
97 |
-
|
98 |
-
We assess models on 7 benchmarks using the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness). This setup is based of that used for [Open LLM Leaderboard.](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
99 |
-
|
100 |
-
- [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) (25-shot) - grade-school science questions.
|
101 |
-
- [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - commonsense inference test.
|
102 |
-
- [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - multitask accuracy test covering 57 tasks.
|
103 |
-
- [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - test measuring model's propensity to reproduce online falsehoods.
|
104 |
-
- [Winogrande](https://arxiv.org/abs/1907.10641) (5-shot) - Winograd benchmark for commonsense reasoning.
|
105 |
-
- [GSM8k](https://arxiv.org/abs/2110.14168) (5-shot) - grade school math word problems test.
|
106 |
-
|
107 |
-
These benchmarks test reasoning and knowledge in various tasks in few-shot settings (higher scores are better).
|
108 |
-
|
109 |
-
| Model | MMLU | Winogrande | GSM8k | ARC | HellaSwag | TruthfulQA | Average |
|
110 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
111 |
-
| TenyxChat-8x7B-v1 | 63.6 | 72.3 | 69.0 | 62.7 | 66.6 | 46.7 | 63.48 |
|
112 |
-
| Starling-7B-alpha | 63.5 | 72.1 | 67.9 | 61.1 | 66.1 | 42.1 | 62.13 |
|
113 |
-
| OpenChat-3.5 | 63.6 | 72.1 | 68.2 | 61.3 | 65.2 | 41.8 | 62.03 |
|
114 |
-
| Mistral-7B | 62.4 | 74.0 | 38.1 | 57.2 | 62.8 | 37.8 | 55.38 |
|
115 |
-
| OpenLLM Leader-7B | 64.3 | 78.7 | 73.3 | 66.6 | 68.4 | 58.5 | 68.3 |
|
116 |
-
|
117 |
-
**Note:** While the Open LLM Leaderboard indicates that these chat models perform less effectively compared to the leading 7B model, it's important to note that the leading model struggles in the multi-turn chat setting of MT-Bench (as demonstrated in our evaluation [above](#comparison-with-additional-open-llm-leaderboard-models)). In contrast, TenyxChat-8x7B-v1 demonstrates robustness against common fine-tuning challenges, such as *catastrophic forgetting*. This unique feature enables TenyxChat-8x7B-v1 to excel not only in chat benchmarks like MT-Bench, but also in a wider range of general reasoning benchmarks on the Open LLM Leaderboard.
|
118 |
-
|
119 |
# Limitations
|
120 |
|
121 |
-
TenyxChat-8x7B-v1, like other
|
122 |
|
123 |
# License
|
124 |
|
@@ -126,7 +92,7 @@ TenyxChat-8x7B-v1, similar to Mixtral-8x7B-Instruct-v0.1 , is distributed under
|
|
126 |
|
127 |
# Citation
|
128 |
|
129 |
-
If you use TenyxChat-
|
130 |
|
131 |
```
|
132 |
@misc{tenyxchat2024,
|
|
|
82 |
|
83 |
![hexplot.png](assets/hexplot.png)
|
84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
# Limitations
|
86 |
|
87 |
+
TenyxChat-8x7B-v1, like other medium-sized language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
|
88 |
|
89 |
# License
|
90 |
|
|
|
92 |
|
93 |
# Citation
|
94 |
|
95 |
+
If you use TenyxChat-8x7B-v1 for your research, cite us as
|
96 |
|
97 |
```
|
98 |
@misc{tenyxchat2024,
|