|
--- |
|
license: llama3.1 |
|
--- |
|
|
|
Experimental .GGUF quants for https://huggingface.co/google/gemma-2-9b-it accordingly to LCPP PR |
|
(based on b_3529 and now b_3565 for the newer ones) : https://github.com/ggerganov/llama.cpp/pull/8836 |
|
|
|
These experimental quant strategies revisiting Ikawrakow's work are displaying a slight decrease of perplexity, |
|
including per bpw (from 10%+ for the lowest quants to 0.x% for the highest ones). |
|
This is significant enough to encourage you folks to test them, and provide feedback if pertinent. |
|
|
|
The iMatrix I use is based on Group Merged V3 and enriched with a bit of French, |
|
a bit of Serbian, and a bit of Croatian languages. |
|
|
|
As usual, the name of the quants are a bit pompous, |
|
because they are numbered on the type of tensor quant mainly used as a base for the FFN. |
|
|
|
|
|
ARC and PPL-512 DATA (Get the last data on the main post of the PR thread) : |
|
|
|
``` |
|
IQ1_XS - Unusable on <30B models |
|
PR |
|
1.94 GB (1.93 BPW) |
|
1.81 GiB (1.93 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 40.0024 +/- 0.27710 |
|
|
|
PR2 |
|
1.98 GB (1.97 BPW) |
|
1.84 GiB (1.97 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 33.5198 +/- 0.24187 |
|
|
|
|
|
IQ1_S - Unusable on <30B models |
|
Master |
|
2.01 GB (2.00 BPW) |
|
1.87 GiB (2.00 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 61.2817 +/- 0.41707 |
|
|
|
PR |
|
2.05 GB (2.04 BPW) |
|
1.91 GiB (2.04 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 25.2524 +/- 0.17651 |
|
|
|
PR2 |
|
2.06 GB (2.05 BPW) |
|
1.91 GiB (2.05 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 24.2661 +/- 0.16923 |
|
|
|
|
|
IQ1_M |
|
Master |
|
2.15 GB (2.15 BPW) |
|
2.01 GiB (2.15 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 26.3761 +/- 0.18200 |
|
|
|
PR |
|
2.14 GB (2.13 BPW) |
|
1.99 GiB (2.13 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 20.0588 +/- 0.14001 |
|
|
|
PR2 |
|
2.15 GB (2.14 BPW) |
|
2.00 GiB (2.14 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 18.8721 +/- 0.13233 |
|
|
|
PR3 |
|
2.16 GB (2.15 BPW) |
|
2.01 GiB (2.15 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 18.7469 +/- 0.13140 |
|
|
|
|
|
IQ1_XL |
|
PR |
|
2.21 GB (2.21 BPW) |
|
2.06 GiB (2.21 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 18.5500 +/- 0.12753 |
|
|
|
PR2 |
|
2.23 GB (2.22 BPW) |
|
2.08 GiB (2.22 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 17.4537 +/- 0.11995 |
|
|
|
PR3 |
|
2.25 GB (2.25 BPW) |
|
2.10 GiB (2.25 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 17.3669 +/- 0.11928 |
|
|
|
PR4 |
|
2.28 GB (2.27 BPW) |
|
2.12 GiB (2.27 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 15.5944 +/- 0.10615 |
|
|
|
PR5 |
|
2.28 GB (2.27 BPW) |
|
2.12 GiB (2.27 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 15.4713 +/- 0.10524 |
|
|
|
|
|
IQ2_XXS |
|
Master |
|
2.39 GB (2.38 BPW) |
|
2.23 GiB (2.38 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 15.2572 +/- 0.10267 |
|
|
|
PR |
|
2.38 GB (2.37 BPW) |
|
2.22 GiB (2.37 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 13.8073 +/- 0.09290 |
|
|
|
PR2 |
|
2.40 GB (2.39 BPW) |
|
2.23 GiB (2.39 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 12.9671 +/- 0.08687 |
|
|
|
PR3 |
|
2.42 GB (2.41 BPW) |
|
2.25 GiB (2.41 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 12.5074 +/- 0.08360 |
|
|
|
PR4 |
|
2.42 GB (2.41 BPW) |
|
2.26 GiB (2.41 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 12.4244 +/- 0.08294 |
|
|
|
|
|
IQ2_XS |
|
Master |
|
2.60 GB (2.59 BPW) |
|
2.42 GiB (2.59 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 11.7483 +/- 0.07776 |
|
|
|
PR |
|
2.52 GB (2.51 BPW) |
|
2.35 GiB (2.51 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 11.6639 +/- 0.07805 |
|
|
|
PR2 |
|
2.53 GB (2.52 BPW) |
|
2.36 GiB (2.52 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 11.5685 +/- 0.07742 |
|
|
|
PR3 |
|
2.58 GB (2.57 BPW) |
|
2.40 GiB (2.57 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 11.3031 +/- 0.07514 |
|
|
|
PR4 |
|
2.59 GB (2.58 BPW) |
|
2.42 GiB (2.58 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 10.9291 +/- 0.07270 |
|
|
|
PR5 |
|
2.60 GB (2.59 BPW) |
|
2.42 GiB (2.59 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 10.8794 +/- 0.07229 |
|
|
|
|
|
IQ2_S |
|
Master |
|
2.75 GB (2.74 BPW) |
|
2.56 GiB (2.74 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 10.5180 +/- 0.06976 |
|
|
|
PR (fail) |
|
2.71 GB (2.70 BPW) |
|
2.52 GiB (2.70 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 10.7010 +/- 0.07027 |
|
|
|
PR2 |
|
2.75 GB (2.74 BPW) |
|
2.56 GiB (2.74 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 10.3728 +/- 0.06806 |
|
|
|
|
|
IQ2_M |
|
Master |
|
2.94 GB (2.93 BPW) |
|
2.74 GiB (2.93 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 9.5935 +/- 0.06228 |
|
|
|
PR |
|
2.93 GB (2.92 BPW) |
|
2.73 GiB (2.92 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 9.4125 +/- 0.06039 |
|
|
|
|
|
IQ2_XL |
|
PR |
|
2.99 GB (2.98 BPW) |
|
2.78 GiB (2.98 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 9.3122 +/- 0.05973 |
|
|
|
PR2 |
|
3.11 GB (3.10 BPW) |
|
2.90 GiB (3.10 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 9.0378 +/- 0.05764 |
|
|
|
PR3 |
|
3.14 GB (3.13 BPW) |
|
2.93 GiB (3.13 BPW) |
|
PPL over 564 chunks for n_ctx=512 = 8.8604 +/- 0.05620 |
|
|
|
|
|
IQ3_XXS |
|
|
|
Master |
|
Size : 3.04 GiB (3.25 BPW) |
|
PPL 512 wikitext : 8.4985 +/- 0.05402 |
|
|
|
PR (good) |
|
Size : 3.11 GiB (3.32 BPW) |
|
PPL 512 wikitext : 8.3274 +/- 0.05334 |
|
|
|
PR2 (so so) |
|
llm_load_print_meta: model size = 3.08 GiB (3.29 BPW) |
|
llm_load_print_meta: model size = 3.30 GB (3.29 BPW) |
|
Final estimate: PPL 512 = 8.3906 +/- 0.05329 |
|
|
|
Let's keep the first PR |
|
|
|
|
|
IQ3_XS |
|
|
|
Master |
|
Size : 3.27 GiB (3.50 BPW) |
|
PPL 512 wikitext : 8.2019 +/- 0.05167 |
|
|
|
PR (ok) |
|
Size : 3.24 GiB (3.47 BPW) |
|
PPL 512 wikitext : 8.1762 +/- 0.05176 |
|
|
|
|
|
IQ3_S |
|
|
|
Master |
|
Size : 3.42 GiB (3.66 BPW) |
|
PPL 512 wikitext : 7.9894 +/- 0.05020 |
|
|
|
PR (good) |
|
Size : 3.41 GiB (3.64 BPW) |
|
PPL 512 wikitext : 7.9067 +/- 0.05022 |
|
|
|
|
|
IQ3_M |
|
|
|
Master |
|
Size : 3.52 GiB (3.76 BPW) |
|
PPL 512 wikitext : 7.9263 +/- 0.04943 |
|
|
|
PR (good) |
|
Size : 3.49 GiB (3.73 BPW) |
|
PPL 512 wikitext : 7.8704 +/- 0.04951 |
|
|
|
|
|
IQ3_XL |
|
|
|
PR (good) |
|
Size : 3.71 GiB (3.97 BPW) |
|
PPL 512 wikitext : 7.7225 +/- 0.04946 |
|
|
|
|
|
IQ3_XXL |
|
|
|
PR (good, the benefit seems meager but the token embeddings pushed form IQ3_S to IQ4_XS explains +0.05BPW of it, |
|
and this tensor doesn't run in VRAM but in RAM) |
|
Size : 3.83 GiB (4.09 BPW) |
|
PPL 512 wikitext : 7.6720 +/- 0.04892 |
|
|
|
|
|
IQ3_XXXL |
|
|
|
PR (good) |
|
Size : 3.97 GiB (4.24 BPW) |
|
PPL 512 wikitext : 7.5920 +/- 0.04839 |
|
|
|
|
|
IQ4_XS |
|
|
|
Master |
|
Size : 4.13 GiB (4.42 BPW) |
|
Arc-C 299 49.16387960 |
|
Arc-E 570 72.10526316 |
|
PPL 512 wikitext : 7.5226 +/- 0.04820 |
|
|
|
|
|
IQ4_XSR |
|
|
|
PR (good) |
|
Size : 4.16 GiB (4.45 BPW) |
|
Arc-C 299 |
|
Arc-E 570 |
|
PPL 512 wikitext : 7.5072 +/- 0.04814 |
|
|
|
|
|
FP16 |
|
|
|
MASTER : Gemma 2 9b It F16. |
|
Size : 14.96 GiB (16.00 BPW) |
|
Arc-C 299 49.49832776 |
|
Arc-E 570 73.85964912 |
|
PPL 512 wikitext : 7.3224 +/- 0.04674 |
|
|
|
``` |