Post
1493
Every time a new model is released that is topping 10+ leaderboards on 50+ benchmarks... π
My brain goes... I will wait for the LMSYS Chatbot Arena results! π€
User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. π’
Now we have MixEval, a new open benchmark with a 96% correlation to LMSYS Chatbot Arena and Human preferences. π―
It comes with MixEval (4k samples) and MixEval Hard (1k samples) π
Can use GPT-3.5-Turbo or any other open-source models as Parser/Judge π€
It takes less than 6% of the time and cost of MMLU πΈ
As expected:
In open models: Qwen2 72B >> Llama 3 70B >> Mixtral 8x7B π
In Closed Models: GPT-4o >> Claude 3 Opus >> Gemini Pro π
Leaderboard: https://mixeval.github.io/ π
My brain goes... I will wait for the LMSYS Chatbot Arena results! π€
User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. π’
Now we have MixEval, a new open benchmark with a 96% correlation to LMSYS Chatbot Arena and Human preferences. π―
It comes with MixEval (4k samples) and MixEval Hard (1k samples) π
Can use GPT-3.5-Turbo or any other open-source models as Parser/Judge π€
It takes less than 6% of the time and cost of MMLU πΈ
As expected:
In open models: Qwen2 72B >> Llama 3 70B >> Mixtral 8x7B π
In Closed Models: GPT-4o >> Claude 3 Opus >> Gemini Pro π
Leaderboard: https://mixeval.github.io/ π