Spaces:
Running
on
CPU Upgrade
A bit of reconciliation
Hi,
@clefourrier
I am tagging you for two major things IMO.
- It would be great if there could some kind of reconciliation between what is reported in the Leaderboard and papers for the "mainstream model" like Mistral7B, Phi-2, ... For instance Mistral7B reports is 60.1 vs HF 64.16 for MMLU
- It would be great if we could make sure to include the mainstream foundation models. I can't find the original Llama2 models in the Leaderboard
Cheers and thanks again for the great work.
Hi
@vince62s
,
Thanks for the issue!
For 1, our evaluation results are completely reproducible (see our About), using the specific setup of the Harness (contrary to a lot of reported results in technical reports which do not explain how specifically they evaluate their models) - differences in scores mostly reflect differences in prompting/evaluation setups. Since our goal is to provide one single and reproducible way to evaluate models, this is not something we want to change.
For 2, llama-2 is actually in the leaderboard, it's just under "deleted" (because of an access token problem since it's gated, I'm fixing it, sorry about that).
Does that answer your points?
Thanks for the quick response. Since you guys can have close talks to those Folks, wouldn't that make sense to reach them out and understand their methodology leading to -4 points in mmlu ? it's a big gap.
I'll check llama2 when it's up.
btw I don't see mistral-instruct-v0.2 nor mixtral (legacy ones)
Hi
@vince62s
,
Have you taken a look at the FAQ? It's in the about tab, and I feel like it could answer a number of your questions (regarding duplicates for example).
The mixtral models are flagged atm because of incorrect metadata - if you want to be sure you are displaying all the models, don't forget to select all the available checkboxes :)