steampunque commited on
Commit
c615d4e
·
verified ·
1 Parent(s): 1267a26

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-VL-2B-Instruct
4
+ base_model_relation: quantized
5
+ tags:
6
+ - Qwen3 VL Instruct 2B
7
+ - GGUF
8
+ - quantized
9
+ - 8-bit
10
+ ---
11
+
12
+ ## Llama.cpp hybrid layer quantization of Qwen3-VL-2B-Instruct by Qwen
13
+
14
+ Original model: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
15
+
16
+ The hybrid quant employs different quantization levels on a per layer basis to enable
17
+ both high performance and small file size at the same time. This quant
18
+ was optimized for high performance across a set of test prompts with ~Q8_0 size. The
19
+ model predominantly exhibits rep fails across a set of curated test prompts, where it falls into infinite
20
+ repeat loops on most prompts when using greedy sampling. Extensive testing showed there is no way
21
+ to correct this problem by adjusting the layer quants, the problem is baked into the model by the
22
+ training process. The model is still usable with simple vision prompts but will often rep fail if
23
+ asked to solve a prompt with step by step reasoning with greedy sampling. The VL 32B Instruct model
24
+ does not exhibit this failure mode.
25
+
26
+ The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For
27
+ this file the layer quants are as follows:
28
+ ```
29
+ Q6_K_S : Q6_K
30
+ Q6_K_M : attn_v = q8_0 ffn_d = q8_0
31
+ Q6_K_L : attn_v = q8_0 attn_o = q8_0 ffn_d = q8_0
32
+
33
+ LAYER_TYPES='[
34
+ [0 ,"Q8_0" ],[1 ,"Q8_0" ],[2 ,"Q8_0" ],[3 ,"Q8_0" ],[4 ,"Q6_K_L"],[5 ,"Q6_K_L"],
35
+ [6 ,"Q6_K_L"],[7 ,"Q6_K_L"],[8, "Q6_K_L"],[9, "Q6_K_L"],[10,"Q6_K_M"],[11,"Q6_K_M"],
36
+ [12,"Q6_K_S"],[13,"Q5_K_M"],[14,"Q5_K_M"],[15,"Q6_K_S"],[16,"Q6_K_M"],[17,"Q6_K_M"],
37
+ [18,"Q6_K_L"],[19,"Q6_K_L"],[20,"Q6_K_L"],[21,"Q6_K_L"],[22,"Q6_K_L"],[23,"Q6_K_L"],
38
+ [24,"Q8_0" ],[25,"Q8_0" ],[26,"Q8_0" ],[27,"Q8_0" ]
39
+ ]'
40
+ FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q8_0 --layer-types-high"
41
+ ```
42
+ Comparison:
43
+
44
+ Quant | size | PPL | Comment
45
+ ---------|---------|------|-----------
46
+ Q8_0 | 1.8e9 | 11.9 | Q8_0 with default embedding and output
47
+ Q8_0_H | 1.7e9 | 11.9 | Hybrid quant with Q8_0 embedding Q8_0 output
48
+
49
+ Usage:
50
+
51
+ Qwen3-VL-2B-Instruct is a vision capable model. It can be used together with its multimedia projector layers to process images and text inputs
52
+ and generate text outputs. The mmproj file is made available in this repository. To test vision mode follow the docs in the mtmd readme in the tools
53
+ directory of the source tree https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md .
54
+
55
+ On a 4070 non-code gen rate is about 185tps.
56
+
57
+ Llama.cpp minimum version to run Qwen3-VL series should be 6915 with recommended 6936 and above.
58
+
59
+ Benchmarks:
60
+
61
+ A full set of vision benchmarks for the model is given here: https://huggingface.co/spaces/steampunque/benchlm
62
+
63
+ ## Download the file from below:
64
+ | Link | Type | Size/e9 B | Notes |
65
+ |------|------|-----------|-------|
66
+ | [Qwen3-VL-2B-Instruct.Q8_0_H.gguf](https://huggingface.co/steampunque/Qwen3-VL-2B-Instruct-Hybrid-GGUF/resolve/main/Qwen3-VL-2B-Instruct.Q8_0_H.gguf) | Q8_0_H | 1.7e9 B | ~Q8_0 size |
67
+ | [Qwen3-VL-2B-Instruct.mmproj.gguf](https://huggingface.co/steampunque/Qwen3-VL-2B-Instruct-Hybrid-GGUF/resolve/main/Qwen3-VL-2B-Instruct.mmproj.gguf) | F16 | 0.82e9 B | multimedia projector |
68
+
69
+ A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
70
+
71
+ https://github.com/ggml-org/llama.cpp/discussions/13040