--- license: llama3.1 --- Experimental .GGUF quants for https://huggingface.co/google/gemma-2-9b-it accordingly to LCPP PR (based on b_3529 and now b_3565 for the newer ones) : https://github.com/ggerganov/llama.cpp/pull/8836 These experimental quant strategies revisiting Ikawrakow's work are displaying a slight decrease of perplexity, including per bpw (from 10%+ for the lowest quants to 0.x% for the highest ones). This is significant enough to encourage you folks to test them, and provide feedback if pertinent. The iMatrix I use is based on Group Merged V3 and enriched with a bit of French, a bit of Serbian, and a bit of Croatian languages. ARC and PPL-512 DATA (Get the last data on the main post of the PR thread) : ``` IQ1_XS PR 1.94 GB (1.93 BPW) 1.81 GiB (1.93 BPW) PPL over 564 chunks for n_ctx=512 = 40.0024 +/- 0.27710 IQ1_S Master 2.01 GB (2.00 BPW) 1.87 GiB (2.00 BPW) PPL over 564 chunks for n_ctx=512 = 61.2817 +/- 0.41707 PR 2.05 GB (2.04 BPW) 1.91 GiB (2.04 BPW) PPL over 564 chunks for n_ctx=512 = 25.2524 +/- 0.17651 IQ1_M Master 2.15 GB (2.15 BPW) 2.01 GiB (2.15 BPW) PPL over 564 chunks for n_ctx=512 = 26.3761 +/- 0.18200 PR 2.14 GB (2.13 BPW) 1.99 GiB (2.13 BPW) PPL over 564 chunks for n_ctx=512 = 20.0588 +/- 0.14001 IQ1_XL PR 2.21 GB (2.21 BPW) 2.06 GiB (2.21 BPW) PPL over 564 chunks for n_ctx=512 = 18.5500 +/- 0.12753 IQ2_XXS Master 2.39 GB (2.38 BPW) 2.23 GiB (2.38 BPW) PPL over 564 chunks for n_ctx=512 = 15.2572 +/- 0.10267 PR 2.38 GB (2.37 BPW) 2.22 GiB (2.37 BPW) PPL over 564 chunks for n_ctx=512 = 13.8073 +/- 0.09290 IQ2_XS Master 2.60 GB (2.59 BPW) 2.42 GiB (2.59 BPW) PPL over 564 chunks for n_ctx=512 = 11.7483 +/- 0.07776 PR 2.52 GB (2.51 BPW) 2.35 GiB (2.51 BPW) PPL over 564 chunks for n_ctx=512 = 11.6639 +/- 0.07805 IQ2_S Master 2.75 GB (2.74 BPW) 2.56 GiB (2.74 BPW) PPL over 564 chunks for n_ctx=512 = 10.5180 +/- 0.06976 PR 2.71 GB (2.70 BPW) 2.52 GiB (2.70 BPW) PPL over 564 chunks for n_ctx=512 = 10.7010 +/- 0.07027 IQ2_M Master 2.94 GB (2.93 BPW) 2.74 GiB (2.93 BPW) PPL over 564 chunks for n_ctx=512 = 9.5935 +/- 0.06228 PR 2.93 GB (2.92 BPW) 2.73 GiB (2.92 BPW) PPL over 564 chunks for n_ctx=512 = 9.4125 +/- 0.06039 IQ2_XL PR 2.99 GB (2.98 BPW) 2.78 GiB (2.98 BPW) PPL over 564 chunks for n_ctx=512 = 9.3122 +/- 0.05973 IQ3_XXS Master Size : 3.04 GiB (3.25 BPW) PPL 512 wikitext : 8.4985 +/- 0.05402 PR (good) Size : 3.11 GiB (3.32 BPW) PPL 512 wikitext : 8.3274 +/- 0.05334 PR2 (so so) llm_load_print_meta: model size = 3.08 GiB (3.29 BPW) llm_load_print_meta: model size = 3.30 GB (3.29 BPW) Final estimate: PPL 512 = 8.3906 +/- 0.05329 Let's keep the first PR IQ3_XS Master Size : 3.27 GiB (3.50 BPW) PPL 512 wikitext : 8.2019 +/- 0.05167 PR (ok) Size : 3.24 GiB (3.47 BPW) PPL 512 wikitext : 8.1762 +/- 0.05176 IQ3_S Master Size : 3.42 GiB (3.66 BPW) PPL 512 wikitext : 7.9894 +/- 0.05020 PR (good) Size : 3.41 GiB (3.64 BPW) PPL 512 wikitext : 7.9067 +/- 0.05022 IQ3_M Master Size : 3.52 GiB (3.76 BPW) PPL 512 wikitext : 7.9263 +/- 0.04943 PR (good) Size : 3.49 GiB (3.73 BPW) PPL 512 wikitext : 7.8704 +/- 0.04951 IQ3_XL PR (good) Size : 3.71 GiB (3.97 BPW) PPL 512 wikitext : 7.7225 +/- 0.04946 IQ3_XXL PR (good, the benefit seems meager but the token embeddings pushed form IQ3_S to IQ4_XS explains +0.05BPW of it, and this tensor doesn't run in VRAM but in RAM) Size : 3.83 GiB (4.09 BPW) PPL 512 wikitext : 7.6720 +/- 0.04892 IQ3_XXL PR (good) Size : 3.97 GiB (4.24 BPW) PPL 512 wikitext : 7.5920 +/- 0.04839 IQ4_XS Master Size : 4.13 GiB (4.42 BPW) Arc-C 299 49.16387960 Arc-E 570 72.10526316 PPL 512 wikitext : 7.5226 +/- 0.04820 IQ4_XSR PR (good) Size : 4.16 GiB (4.45 BPW) Arc-C 299 Arc-E 570 PPL 512 wikitext : 7.5072 +/- 0.04814 FP16 MASTER : Gemma 2 9b It F16. Size : 14.96 GiB (16.00 BPW) Arc-C 299 49.49832776 Arc-E 570 73.85964912 PPL 512 wikitext : 7.3224 +/- 0.04674 ```