are benchmark scores normalised to a baseline?

#2
by Abulaphia - opened

In the documentation, I see reference to a baseline model for GSM8k. Are the scores for models on the archived leaderboard raw scores, or are they normalised in some way / compared to a standard benchmark? If the latter, is there somewhere I can find details on the methodology?

Open LLM Leaderboard Archive org

Hi! Here they are all raw, we added normalisation in the v2 only :)
The baseline scores (for the row "baseline") were taken from the papers introducing the benchmarks each time.

clefourrier changed discussion status to closed

Sign up or log in to comment