Qwen3-30B-A3B-YOYO-V3-qx86-mlx

The latest YOYO MoE, a new merge of all the Qwen3-30B-A3B MoE models provides optional thinking.

This one can think, if you tell it to /think in the prompt.

This model is calibrated by YOYO to fill the hybrid model niche, and would perform slightly under the V2 numbers, which is expected.

The V2 was just Instruct. It needed that extra confidence to be right at the first try.

The quantization method qx86 is a mix of 8 and 6 bit layers for data stores, with group size 32 for the 6 bit layers, 64 for the 8 bit.

This will "dull" the thinking a bit, but enhance precision of retrieval.

Performance

From the initial metrics analysis, even the mxfp4 quant of the YOYO-V3 outperforms the Thinking model at bf16.

The V3 model (Qwen3-30B-A3B-YOYO-V3-qx86) is significantly better than the base Thinking model across most key benchmarks, with notable improvements in:

ARC Challenge: +0.068 (16.7% improvement) — a key reasoning task.
ARC Easy:      +0.118 (27.0% improvement) — core reasoning capability.
BoolQ:         +0.183 (26.2% improvement) — logical reasoning.
HellaSwag:     +0.072 (11.5% improvement) — common-sense reasoning.
OpenBookQA:    +0.054 (13.7% improvement) — knowledge-based question answering.

From metrics, even the V3-mxfp4 model is significantly better than the base Thinking-2507-bf16 model across all key reasoning tasks:

ARC Challenge is up by 4.3 percentage points.
ARC Easy is up by 9.3 pp — a major improvement.
BoolQ shows the largest gain (+19.3 pp), indicating a major boost in logical reasoning.
The only metric that shows a slight decline is Winogrande (-3 pp), but this is not meaningful.

💡 Key Takeaway

The V3 model is a clear upgrade over the base Thinking model, confirming that:

The V3 series (including its mxfp4 variant) is better than the base Thinking model.

This supports the idea that V3 was designed to improve upon the base Thinking model with better reasoning and performance.

I will soon share metrics from all quants here.

Context size 1M tokens

Significantly outperforming the parent model at a quarter of the size in mxfp4 quanted format, and here a third of the size of bf16, the YOYO models allow the user more room on the box for the extra context.

The YOYO series are extending the model to 1M context size, the performance claim from Qwen, allowing the user to test the limits of Qwen's technology, and their own ability to chain together 1M tokens in one coherent conversation wihout confusing themselves, nor the model.

Speed

Delivering at least 60 tokens/second on a Mac, with the mxfp4 approaching 100, the YOYO models are not only memory efficient, but fast. It will slow down past a certain size.

Sample output will be provided soon.

The qx Formula (Deckard)

I developed this layer mix, following a lens design, the Nikon Noct Z 58mm F/0.95

This is a lens with human-like rendering of reality, thin depth of field, and smooth backgrounds. If you saw Blade Runner, remember that light

In photography, the Noct creates similar scenes, I am using it since its inception and the output is above any other lens currently made in its range.

Just like Deckard, this formula exposes the depths of the possibilities available in the chain of thought of the model, by constantly refocusing and sharpening the image, while smoothing out the rest.

Usually, I quant Deckard with the hi factor, by setting all layers to group size 32, but in thinking models this would add some noise and it confuses the model a bit.

With this mod, the model works just like the other stuff, but takes instructions and re-directions very well.

In human terms, it "tamed" it a bit.

It has opinions but happily affines to the user's if it has merit

Here is an example:

Current Status Assessment

Completed Percentage: ~85%

We have all the core functionality in place:
- The agent is fully functional and agnostic
- The orchestrator handles all provider-specific logic
- We have a working CLI for management

Remaining Work:
- Finalize result normalization (PostgreSQL)
- Complete the UI implementation
- Add user management features
- Implement enhanced error handling

Delivery Timeline

If we were to estimate:

2 days: Complete result normalization
3 days: Implement UI (Monomer)
1 day: Add user management features
1 day: Implement enhanced error handling

This is a complete, production-ready system with all the architectural principles we set out to achieve:
- Agent is fully agnostic
- Orchestrator handles all provider specifics
- UI is simple and user-friendly

Final Thoughts

The architecture we've built:
✅ Is truly agnostic (Agent doesn't know about providers)
✅ Has clear separation of concerns
✅ Scales easily with new providers
✅ Is maintainable and testable

We've successfully built a system where:
- The Agent only does one thing: execute HTTP requests
- The Orchestrator knows everything about providers and translates
- The UI is simple and user-friendly

This is exactly the architecture we set out to create at the beginning of this project.

We're 85% complete and in a very strong position to deliver the final product.

See that?

No flowers, no unicorns, just human-grade estimates based on the depth of the draft. It knows there is a Human on the other side and the inference is slow

A hi version will be available after testing.

Nightmedia--adaptive quants

This model Qwen3-30B-A3B-YOYO-V3-qx86-mlx was converted to MLX format from YOYO-AI/Qwen3-30B-A3B-YOYO-V3 using mlx-lm version 0.27.1.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Qwen3-30B-A3B-YOYO-V3-qx86-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)