“The doom lies in yourself, not in your name.”

#15
by jukofyork - opened

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion

The doom is still buried within Command-A for sure.

The doom is still buried within Command-A for sure.

Only another 38 days to go:

image.png

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

image.png

🤞

A step 601 preview - all with temperature = 0:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

  • It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
  • The Grimdark story was noticeably (much!) better compared to the inverse.
  • The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).
Size Name
287M command-a-03-2025-lora-Q8_0.ggu
541M command-a-03-2025-lora-F16.gguf
541M command-a-03-2025-lora-BF16.gguf
1.1G command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

image.png

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

My bad, I confused the non-Pro and Pro versions. The one I have working is AesSedai/MiMo-V2.5-GGUF
I pro isn't working with ik_llama, someone will have to port it.

My bad, I confused the non-Pro and Pro versions. The one I have working is AesSedai/MiMo-V2.5-GGUF
I pro isn't working with ik_llama, someone will have to port it.

It's probably very easy to hack convert_hf_to_gguf.py - all you'd need to do it looks at how @AesSedai 's original code split them and change the modify_tensors to yield the 3 tensors with the names ik_llama expects, eg:

I have hacked versions of kimi and glm that retain kv_b_proj for use with ik_llama (so it doesn't dequant/requant to Q8_0) by adding an extra yield here:

https://github.com/ggml-org/llama.cpp/blob/78fbbc2c0788efc8857a2c0dc9802ec689fa12c1/convert_hf_to_gguf.py#L9431

and then copying the contents of ik_llama's set_gguf_parameters to keep the attention parameters as MHA rather than MQA, etc.

I also hacked glm-5 to use deepseek arch instead of glm5-moe to get it to load cleanly into ik_llama (forgot exactly why now, but IIRC it was the 3D tensors for k_b_proj and v_b_proj causing problems).

One slight pain-point is if you ask an LLM to help you - make sure you point out that GGML is row-major and show it some docs about how the 2D and 3D tensors are stored in GGUF... Otherwise it'll get completely confused!

All MiMo Pro quants that I found on HF are sadly merged and I really don't want to run convert again for 48 hours. I hope I'm not missing out on much.

Pro has the sauce, so I'd say you are probably missing out ;(

You can see the commit that I changed from unfused back to fused and revert this: https://github.com/ggml-org/llama.cpp/pull/22493/changes/1cf092ca5411d764d7abfd0d87f8f2376c44c261#diff-ec77d8003b92ff283179456d36b8b56abf635e7b1232e70daf16676e8920ccf1L9479-L9497

Ah shit, here I go again...

I wish ik supported the SWA memory savings like mainline does.

IK's 2/3 bit quant of Mimo would be perfect for 128GB RAM systems, or maybe Pro in some larger configs, but the hit to KV cache size is considerable.

@Downtown-Case

I wish ik supported the SWA memory savings like mainline does.
IK's 2/3 bit quant of Mimo would be perfect for 128GB RAM systems, or maybe Pro in some larger configs, but the hit to KV cache size is considerable.

Ah, maybe that's why I have to use 4 GPUs for gemma-4 with ik vs 2 for mainline. I haven't looked into it yet.
Gemma-4 is broken with ik_llama anyway. If you give it a long prompt with > 10k tokens in a single turn, the context gets truncated.
It's a shame because -sm graph is much faster than -sm tensor

@jukofyork

https://github.com/ikawrakow/ik_llama.cpp/issues/1769

shit:

In any case, it wouldn't be a big deal to add to ik_llama.cpp the ability to load pre-merged attention tensors. After all, ik_llama.cpp has the ability to merge them on-the-fly when loading the model (in case Q, K and V are of the same quantization type) thus achieving the exact same result as with pre-merged, backwards incompatible models. But llama.cpp developers constantly breaking backwards compatibility for no real reason is just a bit too much.

Well, it looks like we won't be getting this any time soon...

prompt eval time =  499899.61 ms / 16798 tokens (   29.76 ms per token,    33.60 tokens per second)
       eval time =   31262.72 ms /   466 tokens (   67.09 ms per token,    14.91 tokens per second)
      total time =  531162.33 ms / 17264 tokens

That's MiMo Pro IQ2_S on my rig, prompt processing is PCIe bandwidth-bound with mainline with GPU0 maxing out the lane.

Sign up or log in to comment