Spaces:
Running
on
CPU Upgrade
Average column values
Hi,
How is the Average column being calculated? When I calculate manually I get sightly different result than the values in that column...
We average the full normalized scores - but since we only display a couple of decimals, I expect you could get differences due to rounding errors.
You can check contents for the full results
I already tried averaging the full normalized scores, but the result is fairly far off.
Taking "vicgalle/Roleplay-Llama-3-8B" for example:
eval = {'eval_name': 'vicgalle_Roleplay-Llama-3-8B_float16', , 'Average ⬆️': 24.3287788759506, 'IFEval': 73.20221456845613, 'BBH': 28.554603909240623,
'MATH Lvl 5': 8.685800604229607, 'GPQA': 1.4541387024608499, 'MUSR': 1.6773437499999992, 'MMLU-PRO': 30.093823877068555,
"Maintainer's Highlight": False}
mean(
[
eval["IFEval"],
eval["BBH"],
eval["MATH Lvl 5"],
eval["GPQA"],
eval["MUSR"],
eval["MMLU-PRO"],
]
)
I get: 23.944654235242627
The average column reads: 24.33 (24.3287788759506)
What am I missing?
This is extremely weird indeed, tagging
@alozowski
for reference - we're investigating asap.
Thanks for reporting!
Hi @Stark2008 ,
Thanks a lot that you've noticed this! We accidentally calculated the average as the sum of all values, including raw ones. I've fixed this so now all the scores are correct
Happy to help, @alozowski :)
Funny and ironic how that happened after explicitly declaring that the average would not be calculated using raw output scores anymore 😅