gagan3012 commited on
Commit
9fc026b
1 Parent(s): 06308e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +407 -0
README.md CHANGED
@@ -30,4 +30,411 @@ parameters:
30
  value: [1, 0.5, 0.7, 0.3, 0]
31
  - value: 0.5
32
  dtype: bfloat16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```
 
30
  value: [1, 0.5, 0.7, 0.3, 0]
31
  - value: 0.5
32
  dtype: bfloat16
33
+ ```
34
+
35
+ # Dataset Card for Evaluation run of gagan3012/MetaModel
36
+
37
+ <!-- Provide a quick summary of the dataset. -->
38
+
39
+ Dataset automatically created during the evaluation run of model [gagan3012/MetaModel](https://huggingface.co/gagan3012/MetaModel) on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
40
+
41
+ The dataset is composed of 63 configuration, each one coresponding to one of the evaluated task.
42
+
43
+ The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results.
44
+
45
+ An additional configuration "results" store all the aggregated results of the run (and is used to compute and display the aggregated metrics on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)).
46
+
47
+ To load the details from a run, you can for instance do the following:
48
+ ```python
49
+ from datasets import load_dataset
50
+ data = load_dataset("open-llm-leaderboard/details_gagan3012__MetaModel",
51
+ "harness_winogrande_5",
52
+ split="train")
53
+ ```
54
+
55
+ ## Latest results
56
+
57
+ These are the [latest results from run 2024-01-04T14:09:43.780941](https://huggingface.co/datasets/open-llm-leaderboard/details_gagan3012__MetaModel/blob/main/results_2024-01-04T14-09-43.780941.json)(note that their might be results for other tasks in the repos if successive evals didn't cover the same tasks. You find each in the results and the "latest" split for each eval):
58
+
59
+ ```python
60
+ {
61
+ "all": {
62
+ "acc": 0.6664380298886512,
63
+ "acc_stderr": 0.031642195230944255,
64
+ "acc_norm": 0.6671639222858992,
65
+ "acc_norm_stderr": 0.03228745343467652,
66
+ "mc1": 0.5691554467564259,
67
+ "mc1_stderr": 0.01733527247533237,
68
+ "mc2": 0.7184177934834866,
69
+ "mc2_stderr": 0.014995634120330182
70
+ },
71
+ "harness|arc:challenge|25": {
72
+ "acc": 0.6843003412969283,
73
+ "acc_stderr": 0.013582571095815291,
74
+ "acc_norm": 0.7107508532423208,
75
+ "acc_norm_stderr": 0.01325001257939344
76
+ },
77
+ "harness|hellaswag|10": {
78
+ "acc": 0.7132045409281019,
79
+ "acc_stderr": 0.004513409114983828,
80
+ "acc_norm": 0.8844851623182632,
81
+ "acc_norm_stderr": 0.0031898897894046684
82
+ },
83
+ "harness|hendrycksTest-abstract_algebra|5": {
84
+ "acc": 0.43,
85
+ "acc_stderr": 0.049756985195624284,
86
+ "acc_norm": 0.43,
87
+ "acc_norm_stderr": 0.049756985195624284
88
+ },
89
+ "harness|hendrycksTest-anatomy|5": {
90
+ "acc": 0.6148148148148148,
91
+ "acc_stderr": 0.04203921040156279,
92
+ "acc_norm": 0.6148148148148148,
93
+ "acc_norm_stderr": 0.04203921040156279
94
+ },
95
+ "harness|hendrycksTest-astronomy|5": {
96
+ "acc": 0.743421052631579,
97
+ "acc_stderr": 0.0355418036802569,
98
+ "acc_norm": 0.743421052631579,
99
+ "acc_norm_stderr": 0.0355418036802569
100
+ },
101
+ "harness|hendrycksTest-business_ethics|5": {
102
+ "acc": 0.75,
103
+ "acc_stderr": 0.04351941398892446,
104
+ "acc_norm": 0.75,
105
+ "acc_norm_stderr": 0.04351941398892446
106
+ },
107
+ "harness|hendrycksTest-clinical_knowledge|5": {
108
+ "acc": 0.6830188679245283,
109
+ "acc_stderr": 0.02863723563980089,
110
+ "acc_norm": 0.6830188679245283,
111
+ "acc_norm_stderr": 0.02863723563980089
112
+ },
113
+ "harness|hendrycksTest-college_biology|5": {
114
+ "acc": 0.7638888888888888,
115
+ "acc_stderr": 0.03551446610810826,
116
+ "acc_norm": 0.7638888888888888,
117
+ "acc_norm_stderr": 0.03551446610810826
118
+ },
119
+ "harness|hendrycksTest-college_chemistry|5": {
120
+ "acc": 0.47,
121
+ "acc_stderr": 0.050161355804659205,
122
+ "acc_norm": 0.47,
123
+ "acc_norm_stderr": 0.050161355804659205
124
+ },
125
+ "harness|hendrycksTest-college_computer_science|5": {
126
+ "acc": 0.48,
127
+ "acc_stderr": 0.05021167315686781,
128
+ "acc_norm": 0.48,
129
+ "acc_norm_stderr": 0.05021167315686781
130
+ },
131
+ "harness|hendrycksTest-college_mathematics|5": {
132
+ "acc": 0.32,
133
+ "acc_stderr": 0.046882617226215034,
134
+ "acc_norm": 0.32,
135
+ "acc_norm_stderr": 0.046882617226215034
136
+ },
137
+ "harness|hendrycksTest-college_medicine|5": {
138
+ "acc": 0.6647398843930635,
139
+ "acc_stderr": 0.03599586301247077,
140
+ "acc_norm": 0.6647398843930635,
141
+ "acc_norm_stderr": 0.03599586301247077
142
+ },
143
+ "harness|hendrycksTest-college_physics|5": {
144
+ "acc": 0.38235294117647056,
145
+ "acc_stderr": 0.04835503696107223,
146
+ "acc_norm": 0.38235294117647056,
147
+ "acc_norm_stderr": 0.04835503696107223
148
+ },
149
+ "harness|hendrycksTest-computer_security|5": {
150
+ "acc": 0.75,
151
+ "acc_stderr": 0.04351941398892446,
152
+ "acc_norm": 0.75,
153
+ "acc_norm_stderr": 0.04351941398892446
154
+ },
155
+ "harness|hendrycksTest-conceptual_physics|5": {
156
+ "acc": 0.625531914893617,
157
+ "acc_stderr": 0.03163910665367291,
158
+ "acc_norm": 0.625531914893617,
159
+ "acc_norm_stderr": 0.03163910665367291
160
+ },
161
+ "harness|hendrycksTest-econometrics|5": {
162
+ "acc": 0.4824561403508772,
163
+ "acc_stderr": 0.04700708033551038,
164
+ "acc_norm": 0.4824561403508772,
165
+ "acc_norm_stderr": 0.04700708033551038
166
+ },
167
+ "harness|hendrycksTest-electrical_engineering|5": {
168
+ "acc": 0.6413793103448275,
169
+ "acc_stderr": 0.039966295748767186,
170
+ "acc_norm": 0.6413793103448275,
171
+ "acc_norm_stderr": 0.039966295748767186
172
+ },
173
+ "harness|hendrycksTest-elementary_mathematics|5": {
174
+ "acc": 0.5,
175
+ "acc_stderr": 0.025751310131230234,
176
+ "acc_norm": 0.5,
177
+ "acc_norm_stderr": 0.025751310131230234
178
+ },
179
+ "harness|hendrycksTest-formal_logic|5": {
180
+ "acc": 0.42857142857142855,
181
+ "acc_stderr": 0.0442626668137991,
182
+ "acc_norm": 0.42857142857142855,
183
+ "acc_norm_stderr": 0.0442626668137991
184
+ },
185
+ "harness|hendrycksTest-global_facts|5": {
186
+ "acc": 0.35,
187
+ "acc_stderr": 0.047937248544110196,
188
+ "acc_norm": 0.35,
189
+ "acc_norm_stderr": 0.047937248544110196
190
+ },
191
+ "harness|hendrycksTest-high_school_biology|5": {
192
+ "acc": 0.8129032258064516,
193
+ "acc_stderr": 0.022185710092252252,
194
+ "acc_norm": 0.8129032258064516,
195
+ "acc_norm_stderr": 0.022185710092252252
196
+ },
197
+ "harness|hendrycksTest-high_school_chemistry|5": {
198
+ "acc": 0.5073891625615764,
199
+ "acc_stderr": 0.035176035403610105,
200
+ "acc_norm": 0.5073891625615764,
201
+ "acc_norm_stderr": 0.035176035403610105
202
+ },
203
+ "harness|hendrycksTest-high_school_computer_science|5": {
204
+ "acc": 0.72,
205
+ "acc_stderr": 0.04512608598542128,
206
+ "acc_norm": 0.72,
207
+ "acc_norm_stderr": 0.04512608598542128
208
+ },
209
+ "harness|hendrycksTest-high_school_european_history|5": {
210
+ "acc": 0.8121212121212121,
211
+ "acc_stderr": 0.03050193405942914,
212
+ "acc_norm": 0.8121212121212121,
213
+ "acc_norm_stderr": 0.03050193405942914
214
+ },
215
+ "harness|hendrycksTest-high_school_geography|5": {
216
+ "acc": 0.8636363636363636,
217
+ "acc_stderr": 0.024450155973189835,
218
+ "acc_norm": 0.8636363636363636,
219
+ "acc_norm_stderr": 0.024450155973189835
220
+ },
221
+ "harness|hendrycksTest-high_school_government_and_politics|5": {
222
+ "acc": 0.8963730569948186,
223
+ "acc_stderr": 0.021995311963644244,
224
+ "acc_norm": 0.8963730569948186,
225
+ "acc_norm_stderr": 0.021995311963644244
226
+ },
227
+ "harness|hendrycksTest-high_school_macroeconomics|5": {
228
+ "acc": 0.6692307692307692,
229
+ "acc_stderr": 0.02385479568097114,
230
+ "acc_norm": 0.6692307692307692,
231
+ "acc_norm_stderr": 0.02385479568097114
232
+ },
233
+ "harness|hendrycksTest-high_school_mathematics|5": {
234
+ "acc": 0.37037037037037035,
235
+ "acc_stderr": 0.02944316932303154,
236
+ "acc_norm": 0.37037037037037035,
237
+ "acc_norm_stderr": 0.02944316932303154
238
+ },
239
+ "harness|hendrycksTest-high_school_microeconomics|5": {
240
+ "acc": 0.7142857142857143,
241
+ "acc_stderr": 0.029344572500634332,
242
+ "acc_norm": 0.7142857142857143,
243
+ "acc_norm_stderr": 0.029344572500634332
244
+ },
245
+ "harness|hendrycksTest-high_school_physics|5": {
246
+ "acc": 0.3708609271523179,
247
+ "acc_stderr": 0.03943966699183629,
248
+ "acc_norm": 0.3708609271523179,
249
+ "acc_norm_stderr": 0.03943966699183629
250
+ },
251
+ "harness|hendrycksTest-high_school_psychology|5": {
252
+ "acc": 0.8422018348623853,
253
+ "acc_stderr": 0.01563002297009246,
254
+ "acc_norm": 0.8422018348623853,
255
+ "acc_norm_stderr": 0.01563002297009246
256
+ },
257
+ "harness|hendrycksTest-high_school_statistics|5": {
258
+ "acc": 0.5740740740740741,
259
+ "acc_stderr": 0.03372343271653062,
260
+ "acc_norm": 0.5740740740740741,
261
+ "acc_norm_stderr": 0.03372343271653062
262
+ },
263
+ "harness|hendrycksTest-high_school_us_history|5": {
264
+ "acc": 0.8578431372549019,
265
+ "acc_stderr": 0.02450980392156862,
266
+ "acc_norm": 0.8578431372549019,
267
+ "acc_norm_stderr": 0.02450980392156862
268
+ },
269
+ "harness|hendrycksTest-high_school_world_history|5": {
270
+ "acc": 0.8565400843881856,
271
+ "acc_stderr": 0.022818291821017012,
272
+ "acc_norm": 0.8565400843881856,
273
+ "acc_norm_stderr": 0.022818291821017012
274
+ },
275
+ "harness|hendrycksTest-human_aging|5": {
276
+ "acc": 0.672645739910314,
277
+ "acc_stderr": 0.03149384670994131,
278
+ "acc_norm": 0.672645739910314,
279
+ "acc_norm_stderr": 0.03149384670994131
280
+ },
281
+ "harness|hendrycksTest-human_sexuality|5": {
282
+ "acc": 0.7557251908396947,
283
+ "acc_stderr": 0.03768335959728743,
284
+ "acc_norm": 0.7557251908396947,
285
+ "acc_norm_stderr": 0.03768335959728743
286
+ },
287
+ "harness|hendrycksTest-international_law|5": {
288
+ "acc": 0.7851239669421488,
289
+ "acc_stderr": 0.037494924487096966,
290
+ "acc_norm": 0.7851239669421488,
291
+ "acc_norm_stderr": 0.037494924487096966
292
+ },
293
+ "harness|hendrycksTest-jurisprudence|5": {
294
+ "acc": 0.8055555555555556,
295
+ "acc_stderr": 0.038260763248848646,
296
+ "acc_norm": 0.8055555555555556,
297
+ "acc_norm_stderr": 0.038260763248848646
298
+ },
299
+ "harness|hendrycksTest-logical_fallacies|5": {
300
+ "acc": 0.754601226993865,
301
+ "acc_stderr": 0.03380939813943354,
302
+ "acc_norm": 0.754601226993865,
303
+ "acc_norm_stderr": 0.03380939813943354
304
+ },
305
+ "harness|hendrycksTest-machine_learning|5": {
306
+ "acc": 0.4732142857142857,
307
+ "acc_stderr": 0.047389751192741546,
308
+ "acc_norm": 0.4732142857142857,
309
+ "acc_norm_stderr": 0.047389751192741546
310
+ },
311
+ "harness|hendrycksTest-management|5": {
312
+ "acc": 0.8446601941747572,
313
+ "acc_stderr": 0.035865947385739734,
314
+ "acc_norm": 0.8446601941747572,
315
+ "acc_norm_stderr": 0.035865947385739734
316
+ },
317
+ "harness|hendrycksTest-marketing|5": {
318
+ "acc": 0.8589743589743589,
319
+ "acc_stderr": 0.02280138253459753,
320
+ "acc_norm": 0.8589743589743589,
321
+ "acc_norm_stderr": 0.02280138253459753
322
+ },
323
+ "harness|hendrycksTest-medical_genetics|5": {
324
+ "acc": 0.7,
325
+ "acc_stderr": 0.046056618647183814,
326
+ "acc_norm": 0.7,
327
+ "acc_norm_stderr": 0.046056618647183814
328
+ },
329
+ "harness|hendrycksTest-miscellaneous|5": {
330
+ "acc": 0.8084291187739464,
331
+ "acc_stderr": 0.014072859310451949,
332
+ "acc_norm": 0.8084291187739464,
333
+ "acc_norm_stderr": 0.014072859310451949
334
+ },
335
+ "harness|hendrycksTest-moral_disputes|5": {
336
+ "acc": 0.7572254335260116,
337
+ "acc_stderr": 0.023083658586984204,
338
+ "acc_norm": 0.7572254335260116,
339
+ "acc_norm_stderr": 0.023083658586984204
340
+ },
341
+ "harness|hendrycksTest-moral_scenarios|5": {
342
+ "acc": 0.39664804469273746,
343
+ "acc_stderr": 0.016361354769822468,
344
+ "acc_norm": 0.39664804469273746,
345
+ "acc_norm_stderr": 0.016361354769822468
346
+ },
347
+ "harness|hendrycksTest-nutrition|5": {
348
+ "acc": 0.7581699346405228,
349
+ "acc_stderr": 0.024518195641879334,
350
+ "acc_norm": 0.7581699346405228,
351
+ "acc_norm_stderr": 0.024518195641879334
352
+ },
353
+ "harness|hendrycksTest-philosophy|5": {
354
+ "acc": 0.7202572347266881,
355
+ "acc_stderr": 0.025494259350694905,
356
+ "acc_norm": 0.7202572347266881,
357
+ "acc_norm_stderr": 0.025494259350694905
358
+ },
359
+ "harness|hendrycksTest-prehistory|5": {
360
+ "acc": 0.7777777777777778,
361
+ "acc_stderr": 0.02313237623454333,
362
+ "acc_norm": 0.7777777777777778,
363
+ "acc_norm_stderr": 0.02313237623454333
364
+ },
365
+ "harness|hendrycksTest-professional_accounting|5": {
366
+ "acc": 0.5035460992907801,
367
+ "acc_stderr": 0.02982674915328092,
368
+ "acc_norm": 0.5035460992907801,
369
+ "acc_norm_stderr": 0.02982674915328092
370
+ },
371
+ "harness|hendrycksTest-professional_law|5": {
372
+ "acc": 0.49478487614080835,
373
+ "acc_stderr": 0.012769541449652547,
374
+ "acc_norm": 0.49478487614080835,
375
+ "acc_norm_stderr": 0.012769541449652547
376
+ },
377
+ "harness|hendrycksTest-professional_medicine|5": {
378
+ "acc": 0.75,
379
+ "acc_stderr": 0.026303648393696036,
380
+ "acc_norm": 0.75,
381
+ "acc_norm_stderr": 0.026303648393696036
382
+ },
383
+ "harness|hendrycksTest-professional_psychology|5": {
384
+ "acc": 0.6813725490196079,
385
+ "acc_stderr": 0.018850084696468712,
386
+ "acc_norm": 0.6813725490196079,
387
+ "acc_norm_stderr": 0.018850084696468712
388
+ },
389
+ "harness|hendrycksTest-public_relations|5": {
390
+ "acc": 0.6818181818181818,
391
+ "acc_stderr": 0.04461272175910509,
392
+ "acc_norm": 0.6818181818181818,
393
+ "acc_norm_stderr": 0.04461272175910509
394
+ },
395
+ "harness|hendrycksTest-security_studies|5": {
396
+ "acc": 0.746938775510204,
397
+ "acc_stderr": 0.027833023871399677,
398
+ "acc_norm": 0.746938775510204,
399
+ "acc_norm_stderr": 0.027833023871399677
400
+ },
401
+ "harness|hendrycksTest-sociology|5": {
402
+ "acc": 0.8258706467661692,
403
+ "acc_stderr": 0.026814951200421603,
404
+ "acc_norm": 0.8258706467661692,
405
+ "acc_norm_stderr": 0.026814951200421603
406
+ },
407
+ "harness|hendrycksTest-us_foreign_policy|5": {
408
+ "acc": 0.91,
409
+ "acc_stderr": 0.028762349126466125,
410
+ "acc_norm": 0.91,
411
+ "acc_norm_stderr": 0.028762349126466125
412
+ },
413
+ "harness|hendrycksTest-virology|5": {
414
+ "acc": 0.5783132530120482,
415
+ "acc_stderr": 0.038444531817709175,
416
+ "acc_norm": 0.5783132530120482,
417
+ "acc_norm_stderr": 0.038444531817709175
418
+ },
419
+ "harness|hendrycksTest-world_religions|5": {
420
+ "acc": 0.7777777777777778,
421
+ "acc_stderr": 0.03188578017686398,
422
+ "acc_norm": 0.7777777777777778,
423
+ "acc_norm_stderr": 0.03188578017686398
424
+ },
425
+ "harness|truthfulqa:mc|0": {
426
+ "mc1": 0.5691554467564259,
427
+ "mc1_stderr": 0.01733527247533237,
428
+ "mc2": 0.7184177934834866,
429
+ "mc2_stderr": 0.014995634120330182
430
+ },
431
+ "harness|winogrande|5": {
432
+ "acc": 0.8342541436464088,
433
+ "acc_stderr": 0.010450899545370632
434
+ },
435
+ "harness|gsm8k|5": {
436
+ "acc": 0.6535253980288097,
437
+ "acc_stderr": 0.013107179054313398
438
+ }
439
+ }
440
  ```