1 12

Morgan McGuire

morgan

https://www.ntentional.com

AI & ML interests

None yet

Organizations

Posts 3

Post

1345

Llama 3.1 405B Instruct beats GPT-4o on MixEval-Hard

Just ran MixEval for 405B, Sonnet-3.5 and 4o, with 405B landing right between the other two at 66.19

The GPT-4o result of 64.7 replicated locally but Sonnet-3.5 actually scored 70.25/69.45 in my replications 🤔 Still well ahead of the other 2 though.

Sammple of 1 of the eval calls here: https://wandb.ai/morgan/MixEval/weave/calls/07b05ae2-2ef5-4525-98a6-c59963b76fe1

Quick auto-logging tracing for openai-compatible clients and many more here: https://wandb.github.io/weave/quickstart/

Post

Fine-tuning LLMs is rad, but how do you manage all your checkpoints and evals in a production setting?

We partnered with @hamel to ship an Enterprise Model Management course packed full of learnings for those training, evaluating and deploying models at work.

Topics include:
- What webhooks are & how to use them to create integrations with different tools
- How to automate train -> eval runs
- Improving model governance and documentation
- Comparing candidate and baseline models
- Design patterns & recipes
- Lots more...

Would love to hear what you think!

👉 https://www.wandb.courses/courses/enterprise-model-management

View all Posts

models 4

datasets 3

morgan/docvqa-nanochat

Viewer • Updated Dec 30, 2025 • 44.8k • 27

morgan/arithmetic_eval

Viewer • Updated Jun 15, 2025 • 200 • 19

morgan/tortas

Viewer • Updated Dec 23, 2022 • 37 • 11