steampunque
/

Qwen3-VL-2B-Instruct-Hybrid-GGUF

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-VL-2B-Instruct
+base_model_relation: quantized
+tags:
+- Qwen3 VL Instruct 2B
+- GGUF
+- quantized
+- 8-bit
+---
+## Llama.cpp hybrid layer quantization of Qwen3-VL-2B-Instruct by Qwen
+Original model: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
+The hybrid quant employs different quantization levels on a per layer basis to enable
+both high performance and small file size at the same time.  This quant
+was optimized for high performance across a set of test prompts with ~Q8_0 size.  The
+model predominantly exhibits rep fails across a set of curated test prompts, where it falls into infinite
+repeat loops on most prompts when using greedy sampling.  Extensive testing showed there is no way
+to correct this problem by adjusting the layer quants, the problem is baked into the model by the
+training process.  The model is still usable with simple vision prompts but will often rep fail if
+asked to solve a prompt with step by step reasoning with greedy sampling. The VL 32B Instruct model
+does not exhibit this failure mode.
+The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants.  For
+this file the layer quants are as follows:
+```
+Q6_K_S : Q6_K
+Q6_K_M : attn_v = q8_0 ffn_d = q8_0
+Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0
+LAYER_TYPES='[
+   [0 ,"Q8_0"  ],[1 ,"Q8_0"  ],[2 ,"Q8_0"  ],[3 ,"Q8_0"  ],[4 ,"Q6_K_L"],[5 ,"Q6_K_L"],
+   [6 ,"Q6_K_L"],[7 ,"Q6_K_L"],[8, "Q6_K_L"],[9, "Q6_K_L"],[10,"Q6_K_M"],[11,"Q6_K_M"],
+   [12,"Q6_K_S"],[13,"Q5_K_M"],[14,"Q5_K_M"],[15,"Q6_K_S"],[16,"Q6_K_M"],[17,"Q6_K_M"],
+   [18,"Q6_K_L"],[19,"Q6_K_L"],[20,"Q6_K_L"],[21,"Q6_K_L"],[22,"Q6_K_L"],[23,"Q6_K_L"],
+   [24,"Q8_0"  ],[25,"Q8_0"  ],[26,"Q8_0"  ],[27,"Q8_0"  ]
+   ]'
+   FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q8_0 --layer-types-high"
+```
+Comparison:
+Quant |  size  |  PPL |   Comment
+---------|---------|------|-----------
+Q8_0  | 1.8e9 | 11.9 |  Q8_0 with default embedding and output
+Q8_0_H |  1.7e9  | 11.9  | Hybrid quant with Q8_0 embedding Q8_0 output
+Usage:
+Qwen3-VL-2B-Instruct is a vision capable model. It can be used together with its multimedia projector layers to process images and text inputs
+and generate text outputs. The mmproj file is made available in this repository. To test vision mode follow the docs in the mtmd readme in the tools
+directory of the source tree https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md .
+On a 4070 non-code gen rate is about 185tps.
+Llama.cpp minimum version to run Qwen3-VL series should be 6915 with recommended 6936 and above.
+Benchmarks:
+A full set of vision benchmarks for the model is given here: https://huggingface.co/spaces/steampunque/benchlm
+## Download the file from below:
+| Link | Type | Size/e9 B | Notes |
+|------|------|-----------|-------|
+| [Qwen3-VL-2B-Instruct.Q8_0_H.gguf](https://huggingface.co/steampunque/Qwen3-VL-2B-Instruct-Hybrid-GGUF/resolve/main/Qwen3-VL-2B-Instruct.Q8_0_H.gguf) | Q8_0_H | 1.7e9 B | ~Q8_0 size |
+| [Qwen3-VL-2B-Instruct.mmproj.gguf](https://huggingface.co/steampunque/Qwen3-VL-2B-Instruct-Hybrid-GGUF/resolve/main/Qwen3-VL-2B-Instruct.mmproj.gguf) | F16 | 0.82e9 B | multimedia projector |
+A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
+https://github.com/ggml-org/llama.cpp/discussions/13040