Nice Leaderboard :)

#1
by Venkman42 - opened

Nice to see some more leaderboards. I feel like you can't really trust the open llm leaderboard at this point and they don't add any phi-2 models except the Microsoft one because of remote code.
Could you add the following models?
Phi-2-dpo
Openchat v1+v2
Mistral v1+v2
Starling Alpha

I would be really interested how phi-2-dpo stacks up against dolphin phi and the other models would be great reference points, since they have been very popular

Owner

Thanks! I added Starling Alpha and Openchat thanks to @gblazex working on uploading more phi-2 models.

Thanks, don't forget the new phixtral models ๐Ÿ˜‰

Oh and btw, gobruins 2.1.1 was flagged as contaminated on the Open LLM Leaderboard because it contains Data for TruthfulQA
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657e6a221e3e9c41a4a8ae23

Owner

Haha phixtral is cooking! Performance won't be that great with this version but it's a start.

Yes, I noticed that. I think @gblazex wanted to compare the performance on the Open LLM Leaderboard vs. Nous benchmark suite. I'll probably remove it.

I'll run these:
Openchat-1210 ("v2")
Mistral v1+v2

Yes, I noticed that. I think @gblazex wanted to compare the performance on the Open LLM Leaderboard vs. Nous benchmark suite. I'll probably remove it.

yes TruthfulQA is part of Nous. I wanted to see how it does on the rest.
No need to be on this leaderboard

(This was the only flagged one that was interesting to me because relatively lot of people liked the model card. It actually does well on other benchmarks like AGIEval).

๐Ÿ‘ Cool leaderboard!

I'm glad to see dolphin-2_6-phi-2 up here, it feels capable and it's cool to see it compared to phi-2. 3B models are pretty under represented on the Open LLM board with a big gap between phi-2 and MiniChat-2, so I'd also like to request stabilityai/stablelm-zephyr-3b. It isn't on the other board due to remote code, it doesn't feel as good as phi-2 but it is decent.

Also Weyaxi/OpenHermes-2.5-neural-chat-v3-3-Slerp is really good and my go-to 7B so I'd like to see how it performs here too.

New Openchat model just dropped, would be a great addition ;)
openchat/openchat-3.5-0106

@Venkman42

@mlabonne Hey, could please you add Microsofts Phi-3-instruct?

I did this for you, results can be found here: https://huggingface.co/spaces/CultriX/Alt_LLM_LeaderBoard?logs=build

Or in more detail: https://gist.github.com/CultriX-Github/63952ac9317c80e241c0337c31e53a13

Thanks :)

Sorry @Venkman42 I forgot to add it. Thanks @CultriX I'm forking you gist :)

Don't tell anyone but I think I forked about every single evaluation you ever ran on your leaderboard so I didn't have to run it myself. So all good :p!

Edit: Hey btw what happened to your leaderboard? It looks like all your scores are now way lower than they used to be (compare your board with mine). Did you change something? The tests run maybe? (Also you have quite a few models added double which is absolutely not a big deal except it ruins the pretty graphs at the bottom which for me personally actually is a little bit of a big deal but that's probably just me lol).

Yeah I removed TruthfulQA from the Average score to make it more accurate. Where do you see duplicated models, is it in Automerger's version?

No on this one. And ahh check that explains a lot! Any reason why you took that one out in particular?

image.png

Ah it's because of the "Other" category, just fixed it thanks.

TruthfulQA is an awful benchmark, I believe HF also thinks of removing it.

I liked your idea of being able to remove certain benchmarks or tests and have the table recalculate a new filtered average. But instead of making the choice for the user, I decided to implement a way so that people can chose which benchmarks they want to use in the table. Take a look and feel free to copy the idea if you like it :)!

https://huggingface.co/spaces/CultriX/Alt_LLM_LeaderBoard

Yeah that's a good idea, might steal it haha

might

edit: nvm I stole your entire leaderboard so I really can't say sh*t lol

Sign up or log in to comment