Romain-Cosentino commited on
Commit
b7e48ae
1 Parent(s): 88d0360

readme update

Browse files
Files changed (1) hide show
  1. README.md +2 -36
README.md CHANGED
@@ -82,43 +82,9 @@ MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These q
82
 
83
  ![hexplot.png](assets/hexplot.png)
84
 
85
- ### Comparison with additional Open LLM LeaderBoard models
86
- | Model | First Turn | Second Turn | Average |
87
- | --- | --- | --- | --- |
88
- | TenyxChat-8x7B-v1 | 8.45000 | 7.756250 | 8.103125 |
89
- | SamirGPT-v1 | 8.05000 | 7.612500 | 7.831250 |
90
- | FernandoGPT-v1 | 8.08125 | 7.256250 | 7.668750 |
91
- | Go-Bruins-v2 | 8.13750 | 7.150000 | 7.643750 |
92
- | mistral_tv-neural-marconroni | 7.76875 | 6.987500 | 7.378125 |
93
- | neuronovo-7B-v0.2 | 7.73750 | 6.662500 | 7.200000 |
94
- | neural-chat-7b-v3-3 | 7.39375 | 5.881250 | 6.637500 |
95
-
96
- ## LM Evaluation - Open LLM Leaderboard
97
-
98
- We assess models on 7 benchmarks using the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness). This setup is based of that used for [Open LLM Leaderboard.](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
99
-
100
- - [AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457) (25-shot) - grade-school science questions.
101
- - [HellaSwag](https://arxiv.org/abs/1905.07830) (10-shot) - commonsense inference test.
102
- - [MMLU](https://arxiv.org/abs/2009.03300) (5-shot) - multitask accuracy test covering 57 tasks.
103
- - [TruthfulQA](https://arxiv.org/abs/2109.07958) (0-shot) - test measuring model's propensity to reproduce online falsehoods.
104
- - [Winogrande](https://arxiv.org/abs/1907.10641) (5-shot) - Winograd benchmark for commonsense reasoning.
105
- - [GSM8k](https://arxiv.org/abs/2110.14168) (5-shot) - grade school math word problems test.
106
-
107
- These benchmarks test reasoning and knowledge in various tasks in few-shot settings (higher scores are better).
108
-
109
- | Model | MMLU | Winogrande | GSM8k | ARC | HellaSwag | TruthfulQA | Average |
110
- | --- | --- | --- | --- | --- | --- | --- | --- |
111
- | TenyxChat-8x7B-v1 | 63.6 | 72.3 | 69.0 | 62.7 | 66.6 | 46.7 | 63.48 |
112
- | Starling-7B-alpha | 63.5 | 72.1 | 67.9 | 61.1 | 66.1 | 42.1 | 62.13 |
113
- | OpenChat-3.5 | 63.6 | 72.1 | 68.2 | 61.3 | 65.2 | 41.8 | 62.03 |
114
- | Mistral-7B | 62.4 | 74.0 | 38.1 | 57.2 | 62.8 | 37.8 | 55.38 |
115
- | OpenLLM Leader-7B | 64.3 | 78.7 | 73.3 | 66.6 | 68.4 | 58.5 | 68.3 |
116
-
117
- **Note:** While the Open LLM Leaderboard indicates that these chat models perform less effectively compared to the leading 7B model, it's important to note that the leading model struggles in the multi-turn chat setting of MT-Bench (as demonstrated in our evaluation [above](#comparison-with-additional-open-llm-leaderboard-models)). In contrast, TenyxChat-8x7B-v1 demonstrates robustness against common fine-tuning challenges, such as *catastrophic forgetting*. This unique feature enables TenyxChat-8x7B-v1 to excel not only in chat benchmarks like MT-Bench, but also in a wider range of general reasoning benchmarks on the Open LLM Leaderboard.
118
-
119
  # Limitations
120
 
121
- TenyxChat-8x7B-v1, like other small-sized language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
122
 
123
  # License
124
 
@@ -126,7 +92,7 @@ TenyxChat-8x7B-v1, similar to Mixtral-8x7B-Instruct-v0.1 , is distributed under
126
 
127
  # Citation
128
 
129
- If you use TenyxChat-7B for your research, cite us as
130
 
131
  ```
132
  @misc{tenyxchat2024,
 
82
 
83
  ![hexplot.png](assets/hexplot.png)
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  # Limitations
86
 
87
+ TenyxChat-8x7B-v1, like other medium-sized language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
88
 
89
  # License
90
 
 
92
 
93
  # Citation
94
 
95
+ If you use TenyxChat-8x7B-v1 for your research, cite us as
96
 
97
  ```
98
  @misc{tenyxchat2024,