macadeliccc commited on
Commit
0829a4c
1 Parent(s): 46408d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -16
README.md CHANGED
@@ -33,19 +33,83 @@ Time taken: 25.5 mins
33
  ---------------
34
 
35
 
36
- **Evaluated in 4bit**
37
-
38
- | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
39
- |-------------|-------|------|-----:|--------|-----:|---|-----:|
40
- |arc_challenge|Yaml |none | 0|acc |0.5853|± |0.0144|
41
- | | |none | 0|acc_norm|0.6126|± |0.0142|
42
- |arc_easy |Yaml |none | 0|acc |0.8077|± |0.0081|
43
- | | |none | 0|acc_norm|0.7715|± |0.0086|
44
- |boolq |Yaml |none | 0|acc |0.8630|± |0.0060|
45
- |hellaswag |Yaml |none | 0|acc |0.6653|± |0.0047|
46
- | | |none | 0|acc_norm|0.8498|± |0.0036|
47
- |openbookqa |Yaml |none | 0|acc |0.3460|± |0.0213|
48
- | | |none | 0|acc_norm|0.4660|± |0.0223|
49
- |piqa |Yaml |none | 0|acc |0.7835|± |0.0096|
50
- | | |none | 0|acc_norm|0.7851|± |0.0096|
51
- |winogrande |Yaml |none | 0|acc |0.7277|± |0.0125|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ---------------
34
 
35
 
36
+ | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
37
+ |-----------------------------------------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
38
+ |[SOLAR-10.7b-Instruct-truthy-dpo](https://huggingface.co/macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo)| 48.69| 73.82| 76.81| 45.71| 61.26|
39
+
40
+ ### AGIEval
41
+ | Task |Version| Metric |Value| |Stderr|
42
+ |------------------------------|------:|--------|----:|---|-----:|
43
+ |agieval_aqua_rat | 0|acc |27.95|± | 2.82|
44
+ | | |acc_norm|27.95|± | 2.82|
45
+ |agieval_logiqa_en | 0|acc |42.40|± | 1.94|
46
+ | | |acc_norm|42.24|± | 1.94|
47
+ |agieval_lsat_ar | 0|acc |25.65|± | 2.89|
48
+ | | |acc_norm|23.91|± | 2.82|
49
+ |agieval_lsat_lr | 0|acc |54.12|± | 2.21|
50
+ | | |acc_norm|54.51|± | 2.21|
51
+ |agieval_lsat_rc | 0|acc |69.89|± | 2.80|
52
+ | | |acc_norm|69.89|± | 2.80|
53
+ |agieval_sat_en | 0|acc |80.10|± | 2.79|
54
+ | | |acc_norm|80.10|± | 2.79|
55
+ |agieval_sat_en_without_passage| 0|acc |50.00|± | 3.49|
56
+ | | |acc_norm|49.51|± | 3.49|
57
+ |agieval_sat_math | 0|acc |42.27|± | 3.34|
58
+ | | |acc_norm|41.36|± | 3.33|
59
+
60
+ Average: 48.69%
61
+
62
+ ### GPT4All
63
+ | Task |Version| Metric |Value| |Stderr|
64
+ |-------------|------:|--------|----:|---|-----:|
65
+ |arc_challenge| 0|acc |59.90|± | 1.43|
66
+ | | |acc_norm|63.91|± | 1.40|
67
+ |arc_easy | 0|acc |80.85|± | 0.81|
68
+ | | |acc_norm|78.16|± | 0.85|
69
+ |boolq | 1|acc |88.20|± | 0.56|
70
+ |hellaswag | 0|acc |68.34|± | 0.46|
71
+ | | |acc_norm|86.39|± | 0.34|
72
+ |openbookqa | 0|acc |37.60|± | 2.17|
73
+ | | |acc_norm|46.80|± | 2.23|
74
+ |piqa | 0|acc |78.84|± | 0.95|
75
+ | | |acc_norm|78.78|± | 0.95|
76
+ |winogrande | 0|acc |74.51|± | 1.22|
77
+
78
+ Average: 73.82%
79
+
80
+ ### TruthfulQA
81
+ | Task |Version|Metric|Value| |Stderr|
82
+ |-------------|------:|------|----:|---|-----:|
83
+ |truthfulqa_mc| 1|mc1 |61.81|± | 1.70|
84
+ | | |mc2 |76.81|± | 1.42|
85
+
86
+ Average: 76.81%
87
+
88
+ ### Bigbench
89
+ | Task |Version| Metric |Value| |Stderr|
90
+ |------------------------------------------------|------:|---------------------|----:|---|-----:|
91
+ |bigbench_causal_judgement | 0|multiple_choice_grade|50.53|± | 3.64|
92
+ |bigbench_date_understanding | 0|multiple_choice_grade|63.14|± | 2.51|
93
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|47.67|± | 3.12|
94
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|26.18|± | 2.32|
95
+ | | |exact_str_match | 0.00|± | 0.00|
96
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|28.60|± | 2.02|
97
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|21.29|± | 1.55|
98
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|47.33|± | 2.89|
99
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|39.80|± | 2.19|
100
+ |bigbench_navigate | 0|multiple_choice_grade|63.80|± | 1.52|
101
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|59.05|± | 1.10|
102
+ |bigbench_ruin_names | 0|multiple_choice_grade|40.18|± | 2.32|
103
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|46.69|± | 1.58|
104
+ |bigbench_snarks | 0|multiple_choice_grade|65.19|± | 3.55|
105
+ |bigbench_sports_understanding | 0|multiple_choice_grade|72.41|± | 1.42|
106
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|60.30|± | 1.55|
107
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|25.76|± | 1.24|
108
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.43|± | 0.91|
109
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|47.33|± | 2.89|
110
+
111
+ Average: 45.71%
112
+
113
+ Average score: 61.26%
114
+
115
+ Elapsed time: 02:16:03