mlabonne commited on
Commit
3ce28fc
1 Parent(s): 83c8851

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md CHANGED
@@ -14,6 +14,89 @@ This model is a Mixure of Experts (MoE) made with [mergekit](https://github.com/
14
  * [maywell/PiVoT-0.1-Starling-LM-RP](https://huggingface.co/maywell/PiVoT-0.1-Starling-LM-RP)
15
  * [WizardLM/WizardMath-7B-V1.1](https://huggingface.co/WizardLM/WizardMath-7B-V1.1)
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## 🧩 Configuration
18
 
19
  ```yaml
 
14
  * [maywell/PiVoT-0.1-Starling-LM-RP](https://huggingface.co/maywell/PiVoT-0.1-Starling-LM-RP)
15
  * [WizardLM/WizardMath-7B-V1.1](https://huggingface.co/WizardLM/WizardMath-7B-V1.1)
16
 
17
+ ## 🏆 Evaluation
18
+
19
+ | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
20
+ |--------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
21
+ |[Beyonder-4x7B-v2](https://huggingface.co/shadowml/Beyonder-4x7B-v2)| 45.29| 75.95| 60.86| 46.4| 57.13|
22
+ |[NeuralHermes-2.5-Mistral-7B](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B)| 43.67| 73.24| 55.37| 41.76| 53.51|
23
+ |[OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B)| 42.75| 72.99| 52.99| 40.94| 52.42|
24
+
25
+ ### AGIEval
26
+ | Task |Version| Metric |Value| |Stderr|
27
+ |------------------------------|------:|--------|----:|---|-----:|
28
+ |agieval_aqua_rat | 0|acc |23.62|± | 2.67|
29
+ | | |acc_norm|23.62|± | 2.67|
30
+ |agieval_logiqa_en | 0|acc |41.47|± | 1.93|
31
+ | | |acc_norm|43.01|± | 1.94|
32
+ |agieval_lsat_ar | 0|acc |23.04|± | 2.78|
33
+ | | |acc_norm|23.48|± | 2.80|
34
+ |agieval_lsat_lr | 0|acc |51.57|± | 2.22|
35
+ | | |acc_norm|52.94|± | 2.21|
36
+ |agieval_lsat_rc | 0|acc |64.31|± | 2.93|
37
+ | | |acc_norm|64.68|± | 2.92|
38
+ |agieval_sat_en | 0|acc |79.13|± | 2.84|
39
+ | | |acc_norm|79.13|± | 2.84|
40
+ |agieval_sat_en_without_passage| 0|acc |43.20|± | 3.46|
41
+ | | |acc_norm|43.20|± | 3.46|
42
+ |agieval_sat_math | 0|acc |34.55|± | 3.21|
43
+ | | |acc_norm|32.27|± | 3.16|
44
+
45
+ Average: 45.29%
46
+
47
+ ### GPT4All
48
+ | Task |Version| Metric |Value| |Stderr|
49
+ |-------------|------:|--------|----:|---|-----:|
50
+ |arc_challenge| 0|acc |61.86|± | 1.42|
51
+ | | |acc_norm|64.51|± | 1.40|
52
+ |arc_easy | 0|acc |85.06|± | 0.73|
53
+ | | |acc_norm|82.45|± | 0.78|
54
+ |boolq | 1|acc |88.35|± | 0.56|
55
+ |hellaswag | 0|acc |68.04|± | 0.47|
56
+ | | |acc_norm|85.12|± | 0.36|
57
+ |openbookqa | 0|acc |37.80|± | 2.17|
58
+ | | |acc_norm|48.60|± | 2.24|
59
+ |piqa | 0|acc |83.08|± | 0.87|
60
+ | | |acc_norm|83.95|± | 0.86|
61
+ |winogrande | 0|acc |78.69|± | 1.15|
62
+
63
+ Average: 75.95%
64
+
65
+ ### TruthfulQA
66
+ | Task |Version|Metric|Value| |Stderr|
67
+ |-------------|------:|------|----:|---|-----:|
68
+ |truthfulqa_mc| 1|mc1 |44.55|± | 1.74|
69
+ | | |mc2 |60.86|± | 1.57|
70
+
71
+ Average: 60.86%
72
+
73
+ ### Bigbench
74
+ | Task |Version| Metric |Value| |Stderr|
75
+ |------------------------------------------------|------:|---------------------|----:|---|-----:|
76
+ |bigbench_causal_judgement | 0|multiple_choice_grade|58.95|± | 3.58|
77
+ |bigbench_date_understanding | 0|multiple_choice_grade|66.40|± | 2.46|
78
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|48.84|± | 3.12|
79
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|22.56|± | 2.21|
80
+ | | |exact_str_match |13.37|± | 1.80|
81
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.40|± | 2.06|
82
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|20.57|± | 1.53|
83
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|52.00|± | 2.89|
84
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|44.40|± | 2.22|
85
+ |bigbench_navigate | 0|multiple_choice_grade|52.10|± | 1.58|
86
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|69.75|± | 1.03|
87
+ |bigbench_ruin_names | 0|multiple_choice_grade|55.36|± | 2.35|
88
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|23.65|± | 1.35|
89
+ |bigbench_snarks | 0|multiple_choice_grade|77.35|± | 3.12|
90
+ |bigbench_sports_understanding | 0|multiple_choice_grade|73.02|± | 1.41|
91
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|46.80|± | 1.58|
92
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.08|± | 1.17|
93
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|19.03|± | 0.94|
94
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|52.00|± | 2.89|
95
+
96
+ Average: 46.4%
97
+
98
+ Average score: 57.13%
99
+
100
  ## 🧩 Configuration
101
 
102
  ```yaml