What is the .bin file?

#1
by Downtown-Case - opened

Is that imatrix data? I thought ik_llama.cpp only took .dat files, and it seems quite huge, heh.

Oh, that must be for KLD testing or something.

I would appreciate an imatrix if you have it, heh.

Yes, I'll be performing KLD testing for the quants I upload here and I uploaded the logits for another fellow to download for their own testing. I have a couple of quants uploading now, my upload speed is like a 4MB/s average so it'll take a few days for this repo to be populated.

The imatrix was from bartowski, in his repo here: https://huggingface.co/bartowski/zai-org_GLM-4.6-GGUF/tree/main

The quants here will be following a bit of a different schema than the usual llama.cpp quants, most of the model will be kept in Q8 and only the FFN_UP, FN_GATE, and FFN_DOWN tensors for the conditional experts will be quantized lower. This has shown in some previous testing to better preserve the KL divergence compared to the reference Q8_0 model while being the same size or small than the usual llama.cpp quants. So I'll post those results here when testing is concluded.

Cool, thanks for the info!

What's the approximate command you use for KLD testing? I'm interested in this too, and TBH don't have the hardware for the Q8_0 data, so that's useful to me.

I'm quanting something similar right now, though I'm doing slightly lower quants than Q8_0:

# Attention (GPU)
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_ks

# First 3 Dense Layers [0-2] (GPU)
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq5_ks

# Shared Expert Layers [3-92] (GPU)
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks

# Routed Experts Layers [3-6] (GPU)
blk\.[3-6]\.ffn_down_exps\.weight=iq3_kt
blk\.[3-6]\.ffn_(gate|up)_exps\.weight=iq3_kt

# Routed Experts Layers [7-19] (CPU)
blk\.[7-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[7-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.[1-1][0-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[1-1][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks

# Routed Experts Layers [20-80] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# Routed Experts Layers [81-92] (CPU)
blk\.[8-8][1-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[8-8][1-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.[9-9][0-2]\.ffn_down_exps\.weight=iq3_ks
blk\.[9-9][0-2]\.ffn_(gate|up)_exps\.weight=iq3_ks

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k

And now I'm wondering if there are 'sweet spots' for the attention and and other shared layers, since they do add up to be pretty small.

So, this is how I generated the reference logits to begin with:

./llama-perplexity \
    --n-gpu-layers 999 --threads 48 \
    --override-tensor "blk\.(0|1|2|3|4)\.ffn_.*=CUDA0" \
    --override-tensor "blk\.(5|6|7)\.ffn_.*=CUDA1" \
    --override-tensor "blk\..*_exps\.=CPU" \
    --flash-attn on \
    --file /mnt/srv/host/resources/KLD/ddh0_imat_calibration_data_v2.txt \
    --save-all-logits /mnt/srv/host/resources/GLM-4.6-KLD-ref-logits-Q8_0-ddh0-imat-calibration-data-v2.bin \
    --model /mnt/srv/slush/gguf/GLM-4.6-GGUF/GLM-4.6-Q8_0.gguf

The overrides, gpu layers, and threads are all configurable for your specific setup but this is what works for mine (768GB 12ch DDR5 6000MHz and two 3090s). That produces a set of reference logits based on the text corpus.

Afterwards, I produce a model quant then test it as follows:

./llama-perplexity \
    --n-gpu-layers 999 --threads 48 \
    --override-tensor "blk\.(0|1|2|3|4)\.ffn_.*=CUDA0" \
    --override-tensor "blk\.(5|6|7)\.ffn_.*=CUDA1" \
    --override-tensor "blk\..*_exps\.=CPU" \
    --flash-attn on \
    --file /mnt/srv/host/resources/KLD/ddh0_imat_calibration_data_v2.txt \
    --kl-divergence --kl-divergence-base /mnt/srv/host/resources/GLM-4.6-KLD-ref-logits-Q8_0-ddh0-imat-calibration-data-v2.bin \
    --model /mnt/srv/host/gguf/GLM-4.6-GGUF/aes_sedai/GLM-4.6-Q8_0-Q4_K-Q4_K-Q5_K.gguf

That spits out a set of statistics at the end for Perplexity, KL divergence, and Token probabilities. I've got a little automation set up to save the output of that llama-perplexity into a .md file per quant that can then be processed with another script into a CSV output and a series of plots.

For reference, here are a couple of plots for quants I tested for GLM-4.5. Mean KLD vs File Size first:
01_kld_vs_filesize

and PPL vs. Mean KLD here:
02_ppl_ratio_vs_kld

Most of my quants were Q8 for the default type, with just the FFN_UP, FFN_GATE, FFN_DOWN tensors quanted to eg, Q4 / Q4 / Q5 respectively. This means that the shared experts are all Q8, the attention is Q8 (which means it does take up more VRAM for context), etc. Just quanting the conditional experts down via those FFNs looks really appealing IMO. I'm currently quanting and doing similar testing for GLM-4.6, and I'll upload a few of the more promising mainline llama.cpp and ik_llama.cpp quants over the next few days.

me lurking

the attention is Q8 (which means it does take up more VRAM for context)

This is not true btw :3

Sign up or log in to comment