Surprise result!
@sthenno
,
@CultriX
, I think you'll want to see this. I made this merge because I felt Lamarck hadn't integrated DeepSeek R1 enough, and a model_stock would make the MUSR pop. That's not what happened. Most scores fell slightly towards the average, but - look at the MATH.
It appears that R1 and Qwenvergence v9 (hence DRT) are clashing on MUSR, but a model_stock shows where they are synergistic on MATH.
Amazing! --but, I got a lot of confusions in MATH. See: https://huggingface.co/bamec66557/Qwen-2.5-14B-MINUS/discussions/1#6792f65509f4f9090f0c62bd
I made this Space to try and make sense of it but it's hard haha: https://huggingface.co/spaces/CultriX/Tiny-LeaderBoard
(Note: the heatmap/plot is useful, but it also allows you to easily scrape mergekit configurations from other models if they have them on their model page. What I used to do was feed Chat-GPT those configurations combined with extensive benchmark results and then keep tweaking its replies and my prompts until it came up with a merge idea I thought could really work. And they do, they landed me number one spot but I have since then been overtaken by you guys and I can't seem to catch up with that tactic anymore ;) )
I do believe that deepseek might be interesting to play around with though!
Edit: Even though it does not rank the highest here, my actual experience when just talking to it makes Hyperionv3 my favorite of my own models. I am pretty happy with how it performs! :)
Edit2: It's fun to see that Lamarck whose model currently ranks the best is actually ranking about average on these benchmarks.... Figures that the task is also really important in judging it (for example the benchmarks on the current open-llm-leaderboard tend to be more complex, and it seems to outperform there, but it seems that on simpler tasks it's just doing average. So fascinating!)
Also fun little side project: Tally-voting to generate datasets for preference finetuning:
Even though Wernickev3 is far from the best on the leaderboard, it did better than the other ones here:
=== Processing Prompt 100/100 ===
2025-01-27 23:07:39,896 - INFO - Processing Prompt 100/100: Explain the concept of "fake news" in a short overview.
2025-01-27 23:08:03,241 - INFO - Evaluator 'groq-llama3-70b-8192' returned rankings: {'groq-mixtral-8x7b-32768': 95, 'gpt-4o': 92, 'hyperionv3': 80, 'brocav3': 85, 'brocav9': 88, 'wernickev3': 90}
2025-01-27 23:08:04,163 - INFO - Evaluator 'gpt-3.5-turbo-16k' returned rankings: {'groq-mixtral-8x7b-32768': 95, 'gpt-4o': 90, 'hyperionv3': 70, 'brocav3': 85, 'brocav9': 80, 'wernickev3': 90}
2025-01-27 23:08:04,316 - INFO - Evaluator 'gpt-4o-mini' returned rankings: {'groq-mixtral-8x7b-32768': 90, 'gpt-4o': 95, 'hyperionv3': 85, 'brocav3': 88, 'brocav9': 92, 'wernickev3': 91}
2025-01-27 23:08:04,316 - INFO - Average Scores: {'groq-mixtral-8x7b-32768': 93.33333333333333, 'gpt-4o': 92.33333333333333, 'hyperionv3': 78.33333333333333, 'brocav3': 86.0, 'brocav9': 86.66666666666667, 'wernickev3': 90.33333333333333}
2025-01-27 23:08:04,317 - INFO - Best: groq-mixtral-8x7b-32768 -> 93.33333333333333
2025-01-27 23:08:04,317 - INFO - Worst: hyperionv3 -> 78.33333333333333
2025-01-27 23:08:04,317 - INFO - Average across all models for this prompt: 87.83333333333333
2025-01-27 23:08:04,317 - INFO - Total Combined Scores for Prompt 100: {'groq-mixtral-8x7b-32768': 280, 'gpt-4o': 277, 'brocav9': 260, 'wernickev3': 271, 'brocav3': 258, 'hyperionv3': 235}
2025-01-27 23:08:04,318 - INFO - Current Cumulative Leaderboard after this prompt:
2025-01-27 23:08:04,318 - INFO - 1. groq-mixtral-8x7b-32768 -> 27380.00
2025-01-27 23:08:04,318 - INFO - 2. gpt-4o -> 27286.00
2025-01-27 23:08:04,319 - INFO - 3. wernickev3 -> 26859.00
2025-01-27 23:08:04,319 - INFO - 4. hyperionv3 -> 26641.00
2025-01-27 23:08:04,319 - INFO - 5. brocav3 -> 26484.00
2025-01-27 23:08:04,319 - INFO - 6. brocav9 -> 26172.00
2025-01-27 23:08:04,329 - INFO - Intermediate dataset saved to 'dpo_preferences_dataset' after prompt 100.
2025-01-27 23:08:04,429 - INFO - Line chart saved to 'leaderboard_plots\leaderboard_linechart_100.png'.
2025-01-27 23:08:04,439 - INFO - Final dataset saved to 'dpo_preferences_dataset'.
2025-01-27 23:08:04,439 - INFO - DPO dataset generation completed.
The dataset is here: https://huggingface.co/datasets/CultriX/PFT-MME
That also links to the full script I used to do this :)
Interesting! I really have to look into how differently models rank on there versus the Open LLM Leaderboard. Output quality really isn't always reflected in the raw benchmark numbers; Lamarck v0.4 and v0.6 are where I saw large jumps in output quality as I wished to see it, but you'd not know that on the leaderboard. And here, hyperionv3 being your favorite - same thing, sounds like! A good smoke test for new models really matters, and if English prose alone is the measure, I'm sure we could merge more heavily with https://huggingface.co/underwoods/medius-erebus-magnum-14b and score lower, but have more fun. Precision has been why I've always kept that influence light!
Though I'm happy with recent jumps in MATH and MUSR, I really value GPQA, and you consistently have some of the most important merges for that benchmark! I wonder if your reliance on these tiny benchmarks is why?
Finetunes of 40+ average models like Lamarck are beginning to appear, and I am very optimistic that they'll provide a solid base for our next round of merges! I would never have guessed months ago how much our overlapping efforts could raise the bar.
@sthenno , I'm wondering if Qwenvergence v9 performs better re: the MATH confusions you're tracking. I'm starting to think that we need more A/B comparisons of otherwise identical merges, with and without DeepSeek R1.
@CultriX , @sthenno , I've decided to follow parallel merge paths, one with DeepSeek R1, and one without, at least until we get to the bottom of these variances. Here's a DeepSeek-free model_stock merge of the kind I often use as the first stage of a Lamarck merge sequence, but this one attempts to be at the center of our common interests: https://huggingface.co/sometimesanotion/Qwenvergence-14B-v11
It includes https://huggingface.co/sthenno/tempesthenno-ppo-ckpt40 as a merged model and as LoRAs to fill gaps in older models' performance, and includes https://huggingface.co/CultriX/Qwen2.5-14B-Hyperionv4 to round it out.
I also think @jpacifico is on a good track with his finetune of Lamarck v0.7, https://huggingface.co/jpacifico/Chocolatine-2-14B-Instruct-v2.0b3 - and though that means DeepSeek R1 is included, I've enjoyed its quality. Perhaps we can still tame these distills!
https://huggingface.co/sometimesanotion/Qwenvergence-14B-v11
Good luck! I'm also looking forward to the eval results.
While the leaderboard doesn't have the results yet, comparator does. Congratulations on https://huggingface.co/sthenno/tempesthenno-nuslerp-0124, @sthenno ! We're tightening up, at least for merges including DeepSeek via Qwenvergence 10. But the question is now whether this model shows the same behaviors on MATH you've seen with Lamarck. Lamarck is doing well at what I need from it, but feedback from other use cases and with similarly merged models helps.
Qwenverge 11 falters a little on MUSR but has otherwise done acceptably well, and I credit the LoRAs from tempesthenno-ppo-ckpt40 for the decent IFEVAL. That's okay. This is the raw material for the branch/branch -> SLERP strategies we're trying. I seriously think stabilizing this and breaking 42 average in 14B models is in the cards.
Here's an interesting experiment: why not take Vimarckoso back to its roots, and ask what would go into Wernicke's recipe with the models and merging chops we have now, with models emphasizing GPQA? Let's see what happens!
https://huggingface.co/sometimesanotion/Qwen2.5-14B-Vimarckoso-v4-model_stock-DS