Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update Jun 7
Post
1493
Every time a new model is released that is topping 10+ leaderboards on 50+ benchmarks... πŸš€

My brain goes... I will wait for the LMSYS Chatbot Arena results! πŸ€”

User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. 🐒

Now we have MixEval, a new open benchmark with a 96% correlation to LMSYS Chatbot Arena and Human preferences. 🎯

It comes with MixEval (4k samples) and MixEval Hard (1k samples) πŸ“Š

Can use GPT-3.5-Turbo or any other open-source models as Parser/Judge πŸ€–

It takes less than 6% of the time and cost of MMLU πŸ’Έ

As expected:
In open models: Qwen2 72B >> Llama 3 70B >> Mixtral 8x7B πŸ”
In Closed Models: GPT-4o >> Claude 3 Opus >> Gemini Pro πŸ”’

Leaderboard: https://mixeval.github.io/ πŸ“ˆ