Mixed Precision GGUF layer quantization of Qwen2.5-Coder-14B-Instruct by Qwen

Original model: https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants.

Q4_K_H layer quants are as follows:

Q4_K_L : Q4_K_M + attn_o = q6_k
Q5_K_L : Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k

   LAYER_TYPES='[
   [0 ,"Q4_K_M"],[1 ,"Q4_K_S"],[2 ,"Q3_K_L"],[3 ,"Q3_K_M"],[4 ,"Q3_K_L"],[5 ,"Q3_K_M"],[6 ,"Q3_K_L"],[7 ,"Q3_K_M"],
   [8 ,"Q3_K_L"],[9 ,"Q3_K_M"],[10,"Q3_K_L"],[11,"Q3_K_M"],[12,"Q3_K_L"],[13,"Q3_K_L"],[14,"Q4_K_S"],[15,"Q3_K_L"],
   [16,"Q4_K_S"],[17,"Q3_K_L"],[18,"Q4_K_S"],[19,"Q3_K_L"],[20,"Q4_K_S"],[21,"Q3_K_L"],[22,"Q4_K_S"],[23,"Q3_K_L"],
   [24,"Q4_K_S"],[25,"Q4_K_S"],[26,"Q4_K_S"],[27,"Q4_K_S"],[28,"Q4_K_S"],[29,"Q4_K_S"],[30,"Q4_K_S"],[31,"Q4_K_S"],
   [32,"Q4_K_M"],[33,"Q4_K_S"],[34,"Q4_K_M"],[35,"Q4_K_S"],[36,"Q4_K_M"],[37,"Q4_K_S"],[38,"Q4_K_M"],[39,"Q4_K_S"],
   [40,"Q4_K_M"],[41,"Q4_K_M"],[42,"Q4_K_M"],[43,"Q4_K_M"],[44,"Q4_K_L"],[45,"Q4_K_M"],[46,"Q4_K_L"],[47,"Q5_K_L"]
   ]'
   FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K --layer-types-high"

This quant was optimized over a small set of curated test prompts for code generation ability and then sanity checked for good performance on humaneval.

Comparison:

Quant size PPL Comment
IQ4_XS 8.2e9 8.03 -
Q4_K_H 8.6e9 8.06 Hybrid quant with Q4_K embedding Q6_K output

Usage:

The model can be speculated with Qwen 2.5 Coder 0.5B Instruct with no vocab translation. It is trained at 32k context which can be extended to 128k using YARN:

-rope-scaling yarn --yarn-orig-ctx 32768 --rope_scale 4

For other than 128k context set rope_scale to the fraction of configured context size / 32768.0.

Approximate performance on 12G VRAM 4070 with weigths and context in VRAM:

Q QKV ND NKV gen tps Comment
Q4_K_H F16 0 16k 48 No draft
Q4_K_H F16 8 12.5k 143 Spec 8
Q4_K_H Q8_0 0 29.5k 48 No draft
Q4_K_H Q8_0 8 22k 143 Spec 8

for speculation a fixed length ND=8 token draft was used with a custom downstream speculator.

Benchmarks:

A full set of code evals for the quant is given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
Qwen2.5-Coder-14B-Instruct.Q4_K_H.gguf Q4_K_H 8.6e9 B better code gen performance than IQ4_XS

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
104
GGUF
Model size
15B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Qwen2.5-Coder-14B-Instruct-MP-GGUF

Base model

Qwen/Qwen2.5-14B
Quantized
(88)
this model