@singhsidhukuldeep on Hugging Face: "Every time a new model is released that is topping 10+ leaderboards on 50+…"

Post

1493

Every time a new model is released that is topping 10+ leaderboards on 50+ benchmarks... 🚀

My brain goes... I will wait for the LMSYS Chatbot Arena results! 🤔

User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. 🐢

Now we have MixEval, a new open benchmark with a 96% correlation to LMSYS Chatbot Arena and Human preferences. 🎯

It comes with MixEval (4k samples) and MixEval Hard (1k samples) 📊

Can use GPT-3.5-Turbo or any other open-source models as Parser/Judge 🤖

It takes less than 6% of the time and cost of MMLU 💸

As expected:
In open models: Qwen2 72B >> Llama 3 70B >> Mixtral 8x7B 🔝
In Closed Models: GPT-4o >> Claude 3 Opus >> Gemini Pro 🔒

Leaderboard: https://mixeval.github.io/ 📈

Join the conversation