open-llm-leaderboard/open_llm_leaderboard · NEW! OpenLLMLeaderboard 2023 fall update

Open LLM Leaderboard org Nov 9, 2023

We spent A YEAR of GPU time for the biggest update of the Open LLM Leaderboard yet! 🤯

With @SaylorTwift , we added 3 new benchmark metrics from the great EleutherAI harness 💥 and re-ran 2000+ models on them! 🚀

🤔 Why?
Our initial evaluations were multiple-choice Q/A datasets:

📚 MMLU, knowledge across many domains
📚 👩‍🔬 ARC, high-school science knowledge
HellaSwag, choosing the plausible next steps of a list of actions.
📚👻 TruthfulQA, logical fallacies and knowledge bias

So... mostly knowledge and some reasoning.

But we wanted
🔭 model creators to get more information on their models capabilities
🔎 model users to select models on metrics relevant for them
⚖ leaderboard rankings to be fairer

🤔 How?
We added 3 harder evaluations, on new capabilities!

DROP 💧
Questions on wikipedia paragraphs. It requires both 1) reading comprehension to extract relevant information and 2) reasoning steps (subtractions, additions, comparisons, counting or sorting, ...) to solve the questions. Many models struggle with it!
Contrary to previous evals, it is generative: the model is not just looking at suggested choices, but must actually generate its own answers. That makes it more relevant to study the actual reasoning capabilities of models in unconstrained setups.
GSM8K 🧮
Diverse grade school math problems. Math was a highly expected and requested new capability to study, with reason; current models have a lot of room to improve on math, and it's a very exciting research direction!
WinoGrande 🍷
Multiple-choice adversarial Winograd completion dataset.
An example must be filled with one of two words - the model must select the most relevant one for the blank. The opposite word drastically changes the meaning of the sentence.
It's a development of the historically significant Winograde Schema Challenge: for a long time one of the most difficult benchmark ever!

🤔 What about the rankings?

💪Pretrained models rankings almost did not change! → Good models stay good, no matter the evals 🏅
🌀 Fine tuned models saw many rerankings for 13B+, and IFT/RL models did not change that much, apart from Hermes models (↓) & Beluga/Trurl models (↑) → We hope it'll help see which fine-tuning strategies are best across tasks!

🤔 Diving very deep in these benchmarks 👀
We've found interesting implementation questions (reminiscent of our blog post on MMLU: https://huggingface.co/blog/evaluating-mmlu-leaderboard).
Feel free to read more on it and join the discussion at https://github.com/EleutherAI/lm-evaluation-harness/issues/978 or here!

🤗 That's it!
We hope you find these new evals interesting, and learn more about your (favorite) models along the way!
Thank you very much for following the leaderboard. We'll keep on upgrading it so it stays a useful resource for the community, & further help model progress 🚀

Many special thanks to:

@thomwolf for his insights and help 🤗
all the researchers who developed these evaluation datasets 🔥
the Eleuther AI team for their great work on the harness 🚀
@Chunte for the gorgeous illustration ❤️

clefourrier pinned discussion Nov 9, 2023

Olofp

Nov 9, 2023

Awesome ❤️
Thank you all for doing this, and keeping us clued in on model performance 👍

deleted

Nov 9, 2023

This comment has been hidden

Ostixe360

Nov 10, 2023

Is there in the evaluations, one evaluation at least that check the language level of the model for various language?
I think it's very important for specific us in differents countries.
Thanks for this better leaderboard

clefourrier

Open LLM Leaderboard org Nov 10, 2023

@Ostixe360 Hi!
We are planning on working on multilingual leaderboards with some partners in the following months, but this is only at very early stages
Being French myself, I 100% agree that we need to evaluate models on more than "just English" 😅

However, in the meantime, you can look at the Upstage leaderboard for Korean capabilities, and the Open Compass one for Chinese capabilities.

ThiloteE

Nov 10, 2023

Regardless of evaluation results, currently it is pretty hard to find models in your language or based on certain specific criteria like maximum VRAM size. I personally also tried the full-text model search of Huggingface, but this seems to be quite inefficient unfortunately. While the original huggingface leaderboard does not allow you to filter by language, you can filter by it on this website: https://llm.extractum.io/list. Just left-click on the language column. It also queries the hugginface leaderboard average model score for most models. Of course, those scores might be skewed based on the english evaluation.

Ostixe360

Nov 10, 2023

Thank both of you for your responses.

asdaweqw12

Nov 12, 2023

Does the eval use custom instructions best suited for the model? For some models, such as Yi, using their custom instruction usually produces way better results.

clefourrier

Open LLM Leaderboard org Nov 13, 2023

Hi @asdaweqw12 !
No, we don't allow custom instructions yet, but we will add them soon!

HenkPoley

Nov 14, 2023

I notice some TruthfulQA scores are missing.

Just sort by worst scores to show them.

clefourrier

Open LLM Leaderboard org Nov 14, 2023

•

edited Nov 14, 2023

@HenkPoley Thank you for reporting! It was a display problem, should be fixed! :)

deleted

Nov 15, 2023

This comment has been hidden

Goldenblood56

Nov 15, 2023

•

edited Nov 15, 2023

Mistral Dolphin 7B seems to be missing. The 2.0 or 2.1 etc. Like deleted or made private or something? Who makes those kinds of decisions? I sometimes notice they may reappear again soon. They just being re-tested or was there a possible flaw in the benchmark etc? Thanks.

clefourrier

Open LLM Leaderboard org Nov 15, 2023

@Goldenblood56 they should still be appearing when you select the "Show gated/deleted/..." checkbox - I'm investigating why. If they are not, please open a dedicated issue so we can keep track.

supercharge19

Nov 15, 2023

If ability of model to be used as an agent can be judged through a matric then please add that matric as well. How good they are in choosing correct tool for the task, parse their own output, adjusting output to be sent as input to next step.

SamuelAzran

Nov 16, 2023

If ability of model to be used as an agent can be judged through a matric then please add that matric as well. How good they are in choosing correct tool for the task, parse their own output, adjusting output to be sent as input to next step.

Great idea!

binhtang

Nov 29, 2023

@clefourrier I've noticed an issue in the DROP implementation by EleutherAI using the commit hash b281b0921b636bc36ad05c0b0b0763bd6dd43463. By default, all models continue generating text until the first "." (see this line), so without any filtering, the F1 metrics are computed using overly lengthy generated texts. For example, Mistral 7B generates 10\n\nPassage: The 2006-07 season was the 10th season for the New Orleans Hornets in the National Basketball Association for the first dataset example. Considering typical LLM behaviors, we should filter the answer using Passage: and calculate the scores using 10 instead of the entire generated text (which is the correct answer by the way). Please see a similar filter used in GSM8K from EleutherAI here.

clefourrier

Open LLM Leaderboard org Nov 30, 2023

•

edited Nov 30, 2023

@binhtang (and cc @Phil337 since you've been concerned about this too)
Thank you for your report!
We've spent the last weeks investigating the DROP scores in more detail, and found concerning things in the scoring - what you just highlighted is not the only problem in the DROP metrics.
We'll publish a blog post about it very soon, and update the leaderboard accordingly

HenkPoley

Dec 2, 2023

@binhtang just pointing to the blog publications:

gregzem

Jan 27

•

edited Jan 27

If you visit https://llm.extractum.io/list/?lbonly, you'll find a comprehensive list of our top models, along with a whole host of other parameters and metrics. Clicking on a specific model allows you to delve deeper into its internals and parameters, including its performance on other benchmarks.

clefourrier

Open LLM Leaderboard org Jan 29

@gregzem
Nice visualization! What do the icon next to the model names correspond to?

clefourrier unpinned discussion Feb 14

clefourrier changed discussion status to closed Feb 14