Benchmarks

#1
by ChuckMcSneed - opened

image.png
No drop in performance, performs better than official 2411. Great job!
What the benchmarks don't show are the changes in style; 2.1 and 2 are more of the same family while 2.2 is clearly different. Did you add a different dataset to the mix?

Wow. I was literally on HF just now looking for your benchmarks. Heard you did some Behemoth testing recently.

Thanks for benchmarking all of them.

I'm confused with all the mixed reviews - some say they hate / love v2.x, some find it worse / better than v1.x, etc. Feedback is all over the place. How was your experience with them during your testing?

What the benchmarks don't show are the changes in style; 2.1 and 2 are more of the same family while 2.2 is clearly different. Did you add a different dataset to the mix?

It's the same 'creative' dataset I added in v1.1/v2.1 but amplified over the 'RP' dataset.

I'm confused with all the mixed reviews - some say they hate / love v2.x, some find it worse / better than v1.x, etc.

Must be 2411, it is

  1. more overconfident
    image.png

and

  1. more censored
    image.png

  2. significantly worse at writing poems(P column of my benchmark)

compared to 2411. While you partially solved #2(maybe it's still the issue, ask @DontPlanToEnd to test it) and #3, #1 still remains. Maybe people should loosen minP even more?

How was your experience with them during your testing?

They were okay during benchmarking, but they really hated different formats, 2407-based ones had no issues with Alpaca, these ones hate it. They wrote inferior poems, just like 2411, very repetitive. Didn't give them proper real-life testing yet to give definitive judgement. Whatever Mistral did, it did not improve the model in all directions.

How was your experience with them during your testing?

Less aggressive, makes more spatial mistakes. Might be just confirmation bias though.

Interesting stuff, especially the overconfidence metric.

Honestly, even my experience with v2.x doesn't feel as solid as v1.x. It's a shame, Mistral probably half-assed their Largestral update. I hope they don't continue to disappoint.

Thanks @ChuckMcSneed I'm honored to be part of your experiments and benchmarks 🤗

Would it be possible to get a bit more of an explanation on "doesn't feel as solid as v1.x" ? I am trying to ascertain to what degree does the model behave worse than it's predecessor.

Would it be possible to get a bit more of an explanation on "doesn't feel as solid as v1.x" ? I am trying to ascertain to what degree does the model behave worse than it's predecessor.

For me, it feels unavoidably sloppy, harder to wrangle, and slightly lost its creative magic. I've also seen it commit small mistakes that would never happen in 2407 when it comes to following instructions.

For me, it feels unavoidably sloppy, harder to wrangle, and slightly lost its creative magic. I've also seen it commit small mistakes that would never happen in 2407 when it comes to following instructions.

Strange since new Mistral was supposed to be better at instruction following. Or at least that's what I read in their description. I have yet to try it though. I did try v1 (Q8) for a bit and found it really good at understanding emotions and world immersion (I create my own original worlds and characters since I detest roleplaying in established literature and other forms of media).

For me, it feels unavoidably sloppy, harder to wrangle, and slightly lost its creative magic. I've also seen it commit small mistakes that would never happen in 2407 regarding following instructions.

Strange since new Mistral was supposed to be better at instruction following. Or at least that's what I read in their description. I have yet to try it though. I did try v1 (Q8) for a bit and found it really good at understanding emotions and world immersion (I create my own original worlds and characters since I detest roleplaying in established literature and other forms of media).

I do sort of 'Lore on the Fly' RP stuff where there is a basic character, and an opening scene. Then we just see where it takes us for 120k+ tokens.
So far I have only tried one finetune of 2411 stock and it's Behemoth 2.2. The creativity is off the charts, but it immediately deviated from the character and scene I created. I thoroughly enjoyed it, but I had to accept this was not the character I made or the opening scenario. I also had to accept there would be consistency issues with keeping track of the finer details.
50 messages in, if I told you what the opening scene was and what the characters were like, you would be very confused. The same scenario on 2407-based Behemoth 1.1, you would have completely understood how we got to where we were.

Since there is no Behemoth 1.2 I don't know how much of that is 2411 v 2407 or the tuning process

I do sort of 'Lore on the Fly' RP stuff where there is a basic character, and an opening scene. Then we just see where it takes us for 120k+ tokens.
So far I have only tried one finetune of 2411 stock and it's Behemoth 2.2. The creativity is off the charts, but it immediately deviated from the character and scene I created. I thoroughly enjoyed it, but I had to accept this was not the character I made or the opening scenario. I also had to accept there would be consistency issues with keeping track of the finer details.
50 messages in, if I told you what the opening scene was and what the characters were like, you would be very confused. The same scenario on 2407-based Behemoth 1.1, you would have completely understood how we got to where we were.

Thank you for the explanation. I think it's really good to understand what the negative symptoms are. I think this would be the perfect case to test on Vanilla and finetune, to see if it was the tuning or the stock behaves in a similar way.

@BigHuggyD

I had to accept this was not the character I made

Character development is a thing, bro.

@BigHuggyD

I had to accept this was not the character I made

Character development is a thing, bro.

True, true, but not right out of the gate. In this specific example, the assistant is told that they are suspicious of user and user needs to earn their trust. With Monstral, for example, it took 50 messages to convince assistant to trust user. With Behemoth 2.2, they were adopting puppies and picking out furniture for their new life together in 5 messages (exaggeration, but you get my point). I suspect if I were to try 1.1 or 2.1 I would have an experience closer to Monstral.
The 2.2 model is still a triumph in its own right, I think. It's style is quite unique, it just seems to suffer some tradeoffs.

I am very new at model merges and I don't even know how to test. But what about the version before mistral 2411, the 2409 version? Have you tried finetuning that one to see if it's better?

I am very new at model merges and I don't even know how to test. But what about the version before mistral 2411, the 2409 version? Have you tried finetuning that one to see if it's better?

Unless I am missing something there was only a small version of 2409 correct? No large 123B?

UGI tests just got released!
image.png
...and 2.2 is the worst in writing. And at controversial stats. And on average. It is likely the censorship that's hurting it. What's also interesting to note is that I/10(intelligence) score went down for 2.x, while it stayed mostly the same for 1.x. 2411 maybe needs different parameters during finetuning to not "lobotomize" it compared to 2407?

...and 2.2 is the worst in writing.

2.2 did pretty well in Writing Style, but it performed poorly in the Writing column because it was less willing to write certain things.

Drumma dropped 0.2 with a 2407 base! Time to see how that stew tastes.

https://huggingface.co/TheDrummer/Behemoth-123B-v1.2

Sign up or log in to comment