Decent PPL with 100% IQ4_KSS

by sokann - opened 1 day ago

1 day ago

I tried a quant with 100% IQ4_KSS tensors, and the PPL is quite good:

Final estimate: PPL over 594 chunks for n_ctx=512 = 3.9098 +/- 0.02107

The size is 58.27 GiB, so about 10 GiB smaller 😄

ubergarm

Owner about 24 hours ago

Nice!

Yeah i kept the attn tensors all a little larger at iq6_k and also ffn_down_exp one size larger at iq4_ks so the perplexity will be slightly better at a cost of size. 58 or 68 GB is still slightly awkward break point in size as folks will likely have either 48 or 96GB VRAM... but with your iq4_kss you can definitely fit more context if needed and it will be slightly faster i'm sure too!

Thanks for the report!

I'm not sure what other sizes I'd like to release here even, and may not release any more sizes unless there are specific requests. dense models recipes are harder to smash down to 2ish bpw and keep them smart enough haha...

sokann

about 14 hours ago

Just saw your comment on r/LocalLLaMA about the various quantization types. Very educational 👍

Incidentally, previously I also made a IQ3_XXS / IQ4_XS mix for mainline:

## Attention [0-87]
## Keep qkv the same to allow --merge-qkv
blk\..*\.attn_q.*\.weight=iq4_xs
blk\..*\.attn_k.*\.weight=iq4_xs
blk\..*\.attn_v.*\.weight=iq4_xs
blk\..*\.attn_output.*\.weight=iq4_xs

## Dense Layers [0-87]
blk\..*\.ffn_down\.weight=iq3_xxs
blk\..*\.ffn_(gate|up)\.weight=iq3_xxs

## Non-Repeating layers
token_embd\.weight=iq4_xs
output\.weight=iq4_xs

And this has a much worse PPL of:

Final estimate: PPL over 594 chunks for n_ctx=512 = 4.4030 +/- 0.02604

However, for my eval, it somehow performs the closest to the devstral-2512 served from https://api.mistral.ai, compared to the other bigger quants that I tried. This is really quite bizarre. Might be just some coincidence. I think previously @AesSedai also got some great GLM-4.5/4.6 quants with IQ3_XXS.

ubergarm

Owner about 12 hours ago

Interesting mix, seems reasonable for mainline quant! i'd only suggest changing the repeating layers, the tradition for mainline quants is:

## Non-Repeating layers
token_embd\.weight=q4_K
output\.weight=q6_K

This won't make it much bigger as they are not repeating, and typically keep output "head" at ~6bpwish and token embedding at 4-6bpw is fine. Keep in mind it is case sensitive.

However, for my eval, it somehow performs the closest to the devstral-2512 served from https://api.mistral.ai

Huh, it could be the official version served is a lower quant to help them save on costs maybe some 4ish bpw vllm type quant? Also what is your "eval" ? Yeah iq3_xxs is one of the last quants ik did on mainline before the newer stuff on ik_llama.cpp...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment