Nice Leaderboard :)
Nice to see some more leaderboards. I feel like you can't really trust the open llm leaderboard at this point and they don't add any phi-2 models except the Microsoft one because of remote code.
Could you add the following models?
Phi-2-dpo
Openchat v1+v2
Mistral v1+v2
Starling Alpha
I would be really interested how phi-2-dpo stacks up against dolphin phi and the other models would be great reference points, since they have been very popular
Thanks, don't forget the new phixtral models ๐
Oh and btw, gobruins 2.1.1 was flagged as contaminated on the Open LLM Leaderboard because it contains Data for TruthfulQA
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657e6a221e3e9c41a4a8ae23
I'll run these:
Openchat-1210 ("v2")
Mistral v1+v2
Yes, I noticed that. I think @gblazex wanted to compare the performance on the Open LLM Leaderboard vs. Nous benchmark suite. I'll probably remove it.
yes TruthfulQA is part of Nous. I wanted to see how it does on the rest.
No need to be on this leaderboard
(This was the only flagged one that was interesting to me because relatively lot of people liked the model card. It actually does well on other benchmarks like AGIEval).
๐ Cool leaderboard!
I'm glad to see dolphin-2_6-phi-2 up here, it feels capable and it's cool to see it compared to phi-2. 3B models are pretty under represented on the Open LLM board with a big gap between phi-2 and MiniChat-2, so I'd also like to request stabilityai/stablelm-zephyr-3b. It isn't on the other board due to remote code, it doesn't feel as good as phi-2 but it is decent.
Also Weyaxi/OpenHermes-2.5-neural-chat-v3-3-Slerp is really good and my go-to 7B so I'd like to see how it performs here too.
New Openchat model just dropped, would be a great addition ;)
openchat/openchat-3.5-0106
@mlabonne Hey, could please you add Microsofts Phi-3-instruct?
I did this for you, results can be found here: https://huggingface.co/spaces/CultriX/Alt_LLM_LeaderBoard?logs=build
Or in more detail: https://gist.github.com/CultriX-Github/63952ac9317c80e241c0337c31e53a13
Thanks :)
Sorry @Venkman42 I forgot to add it. Thanks @CultriX I'm forking you gist :)
Don't tell anyone but I think I forked about every single evaluation you ever ran on your leaderboard so I didn't have to run it myself. So all good :p!
Edit: Hey btw what happened to your leaderboard? It looks like all your scores are now way lower than they used to be (compare your board with mine). Did you change something? The tests run maybe? (Also you have quite a few models added double which is absolutely not a big deal except it ruins the pretty graphs at the bottom which for me personally actually is a little bit of a big deal but that's probably just me lol).
Yeah I removed TruthfulQA from the Average score to make it more accurate. Where do you see duplicated models, is it in Automerger's version?
Ah it's because of the "Other" category, just fixed it thanks.
TruthfulQA is an awful benchmark, I believe HF also thinks of removing it.
I liked your idea of being able to remove certain benchmarks or tests and have the table recalculate a new filtered average. But instead of making the choice for the user, I decided to implement a way so that people can chose which benchmarks they want to use in the table. Take a look and feel free to copy the idea if you like it :)!
Yeah that's a good idea, might steal it haha
might
edit: nvm I stole your entire leaderboard so I really can't say sh*t lol