Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
acc or acc_norm?
#106
by
paopao0226
- opened
Hello, when testing on arc dataset, there are two scores(acc and acc_norm), so which one does the leaderboard use?
In the auto_leaderboard files in load_results.py, everything but mmlu is using acc_norm.
# clone / pull the lmeh eval data
METRICS = ["acc_norm", "acc_norm", "acc_norm", "mc2"]
BENCHMARKS = ["arc_challenge", "hellaswag", "hendrycks", "truthfulqa_mc"]
BENCH_TO_NAME = {
"arc_challenge": AutoEvalColumn.arc.name,
"hellaswag": AutoEvalColumn.hellaswag.name,
"hendrycks": AutoEvalColumn.mmlu.name,
"truthfulqa_mc": AutoEvalColumn.truthfulqa.name,
}
@lilloukas Okkkk thanks! I think it's the result
Hello @paopao0226 , note that we just changed the metric we used for MMLU. The file now reads:
# clone / pull the lmeh eval data
METRICS = ["acc_norm", "acc_norm", "acc", "mc2"]
BENCHMARKS = ["arc:challenge", "hellaswag", "hendrycksTest", "truthfulqa:mc"]
BENCH_TO_NAME = {
"arc:challenge": AutoEvalColumn.arc.name,
"hellaswag": AutoEvalColumn.hellaswag.name,
"hendrycksTest": AutoEvalColumn.mmlu.name,
"truthfulqa:mc": AutoEvalColumn.truthfulqa.name,
}
@SaylorTwift Ok, so is this the latest version of what leaderboard using?
@paopao0226 Yes! :) And you can find the details in the About tab if you need
clefourrier
changed discussion status to
closed