Edit model card

Experimental .GGUF quants for https://huggingface.co/google/gemma-2-9b-it accordingly to LCPP PR (based on b_3529 and now b_3565 for the newer ones) : https://github.com/ggerganov/llama.cpp/pull/8836

These experimental quant strategies revisiting Ikawrakow's work are displaying a slight decrease of perplexity, including per bpw (from 10%+ for the lowest quants to 0.x% for the highest ones). This is significant enough to encourage you folks to test them, and provide feedback if pertinent.

The iMatrix I use is based on Group Merged V3 and enriched with a bit of French, a bit of Serbian, and a bit of Croatian languages.

ARC and PPL-512 DATA (Get the last data on the main post of the PR thread) :

IQ1_XS

PR 1 : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.15 GiB (2.00 BPW)
Arc-C 299     42.80936455   
Arc-E 570     68.24561404  
PPL 512 wikitext : 15.1105 +/- 0.11363

PR 2 : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.16 GiB (2.01 BPW)
PPL 512 wikitext : 14.9768 +/- 0.11234

PR 3 + IK iMat Fix : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.16 GiB (2.01 BPW)
PPL 512 wikitext : 14.9768 +/- 0.11234 (apparently useless)

PR 4
2.34 GB (2.02 BPW) 2.18 GiB (2.02 BPW)
PPL over 569 chunks for n_ctx=512 = 14.5476 +/- 0.10791


IQ1_S

MASTER : Gemma 2 9b It IQ1_S, quant made from BF16
Size : 2.21 GiB (2.05 BPW)
Arc-C 299     42.47491639
Arc-E 570     66.84210526
PPL 512 wikitext : 15.9317 +/- 0.11979

PR 1 : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.23 GiB (2.07 BPW)
Arc-C 299     43.14381271
Arc-E 570     68.42105263
PPL 512 wikitext : 14.1578 +/- 0.10530

PR 2 : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.24 GiB (2.08 BPW)
PPL 512 wikitext : 14.0207 +/- 0.10399

PR 3 :
2.43 GB (2.10 BPW) 2.26 GiB (2.10 BPW)
PPL over 569 chunks for n_ctx=512 = 13.6266 +/- 0.10131


IQ1_M

MASTER : Gemma 2 9b It IQ1_M, quant made from BF16
Size : 2.37 GiB (2.20 BPW)
Arc-C 299     45.81939799  
Arc-E 570     73.85964912
PPL 512 wikitext : 13.7215 +/- 0.10231

PR 1 : Gemma 2 9b It IQ1_M quant made from BF16
Size : 2.36 GiB (2.19 BPW)
Arc-C 299     45.81939799
Arc-E 570     74.56140351
PPL 512 wikitext : 12.6773 +/- 0.09336

PR 2
2.55 GB (2.21 BPW) 2.37 GiB (2.21 BPW)
PPL over 569 chunks for n_ctx=512 = 12.4616 +/- 0.09246


IQ1_XL

PR 1 : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.48 GiB (2.30 BPW)
Arc-C 299     47.49163880  
Arc-E 570     73.33333333
PPL 512 wikitext : 11.5001 +/- 0.08487

PR 2 : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.47 GiB (2.29 BPW)
PPL 512 wikitext : 11.4824 +/- 0.08451

PR 3 :
2.69 GB (2.33 BPW) 2.51 GiB (2.33 BPW)
PPL over 569 chunks for n_ctx=512 = 11.1759 +/- 0.08205


IQ2_XXS

MASTER : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.63 GiB (2.44 BPW)
Arc-C 299     48.16053512   
Arc-E 570     73.15789474   
PPL 512 wikitext : 11.2527 +/- 0.08307

PR 1 : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.73 GiB (2.54 BPW)
Arc-C 299     48.82943144
Arc-E 570     74.56140351
PPL 512 wikitext : 10.8439 +/- 0.08026

PR 2 : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.72 GiB (2.53 BPW)
PPL 512 wikitext : 10.8173 +/- 0.07986

PR 3 : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.62 GiB (2.43 BPW)
PPL 512 wikitext : 10.8388 +/- 0.08010

PR 4 : 2.86 GB (2.48 BPW) 2.67 GiB (2.48 BPW)
Final estimate: PPL over 569 chunks for n_ctx=512 = 10.6688 +/- 0.07861


IQ2_XS

MASTER : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.85 GiB (2.65 BPW)
Arc-C 299     49.49832776
Arc-E 570     78.24561404  
PPL 512 wikitext : 10.5698 +/- 0.07803

PR Init : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.91 GiB (2.70 BPW)
Arc-C 299     49.16387960
Arc-E 570     78.59649123
PPL 512 wikitext : 10.3607 +/- 0.07660

PR 2 : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.77 GiB (2.58 BPW)
PPL 512 wikitext : 10.3922 +/- 0.07672

PR 3 : 3.09 GB (2.67 BPW) 2.87 GiB (2.67 BPW)
PPL over 569 chunks for n_ctx=512 = 10.0753 +/- 0.07438


IQ2_S

MASTER : Gemma 2 9b It IQ2_S (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 2.99 GiB (2.77 BPW)
Arc-C 299     52.84280936
Arc-E 570     77.54385965
PPL 512 wikitext : 10.3868 +/- 0.07787

PR Int : Gemma 2 9b It IQ2_S (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     49.83277592
Arc-E 570     77.71929825
PPL 512 wikitext : 10.1303 +/- 0.07486

PR 2 : Gemma 2 9b It IQ2_S, quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     52.17391304
Arc-E 570     77.89473684
PPL 512 wikitext : 10.1071 +/- 0.07450

PR 3 : 3.21 GB (2.78 BPW) 2.99 GiB (2.78 BPW)
PPL over 569 chunks for n_ctx=512 = 9.8175 +/- 0.07239


IQ2_M

MASTER : Gemma 2 9b It IQ2_M (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 3.19 GiB (2.97 BPW)
Arc-C 299     56.52173913
Arc-E 570     77.01754386
PPL 512 wikitext : 9.8154 +/- 0.07324

PR init : Gemma 2 9b It IQ2_M (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.20 GiB (2.98 BPW)
Arc-C 299     54.18060201
Arc-E 570     78.07017544
PPL 512 wikitext :  9.5734 +/- 0.07040

PR 2 : Gemma 2 9b It IQ2_M, quant made from BF16
Size : 3.29 GiB (3.06 BPW)
Arc-C 299     55.85284281
Arc-E 570     78.07017544
PPL 512 wikitext : 9.4128 +/- 0.06881

PR 3 : 3.42 GB (2.96 BPW) 3.19 GiB (2.96 BPW)
PPL over 569 chunks for n_ctx=512 = 9.4207 +/- 0.06894


IQ2_XL

PR Init : Gemma 2 9b It IQ2_XL, quant made from BF16
Size : 3.41 GiB (3.17 BPW)
Arc-C 299     56.18729097
Arc-E 570     78.07017544
PPL 512 wikitext : 9.3283 +/- 0.06820

PR 2 : 3.63 GB (3.14 BPW) 3.38 GiB (3.14 BPW)
PPL over 569 chunks for n_ctx=512 = 9.2667 +/- 0.06814


Q2_K_L

PR CURRENT : Gemma 2 9b It Q2_K_L, quant made from BF16
Size : 3.70 GiB (3.44 BPW)
Arc-C 299     58.19397993
Arc-E 570     79.29824561
PPL 512 wikitext : around 9.25


IQ3_XXS

MASTER : Gemma 2 9b It IQ3_XXS (with iMatrix, attn_k in IQ2_S, and attn_v in IQ3_XXS), quant made from BF16
Size : 3.53 GiB (3.28 BPW)
Arc-C 299 56.52173913
Arc-E 570 79.12280702
PPL 512 wikitext : 9.4116 +/- 0.06982

PR CURRENT : Gemma 2 9b It IQ3_XXS (with Imatrix, attn_k in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.60 GiB (3.35 BPW)
Arc-C 299 56.18729097
Arc-E 570 78.77192982
PPL 512 wikitext : 9.2026 +/- 0.06781


IQ3_XS

MASTER : Gemma 2 9b It IQ3_XS (with iMatrix)), quant made from BF16
Size : 3.85 GiB (3.58 BPW)
Arc-C 299     58.86287625
Arc-E 570     78.94736842
PPL 512 wikitext : 9.2584 +/- 0.06866

PR 1 : Gemma 2 9b It IQ3_XS (with Imatrix), quant made from BF16
Size : 3.82 GiB (3.55 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext :  9.0658 +/- 0.06633

PR 2 :
4.16 GB (3.60 BPW)
3.88 GiB (3.60 BPW)
PPL over 569 chunks for n_ctx=512 = 8.9976 +/- 0.06585


IQ3_S

MASTER : Gemma 2 9b It IQ3_S (with iMatrix, attn_v in IQ3_S), quant made from BF16
Size : 4.03 GiB (3.75 BPW)
Arc-C 299     57.52508361
Arc-E 570     77.71929825
PPL 512 wikitext : 9.2100 +/- 0.06859

PR : Gemma 2 9b It IQ3_S (with Imatrix, attn_v in Q4_K), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext : 9.0082 +/- 0.06633

PR rev 2: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     56.85618729
Arc-E 570     78.42105263
PPL 512 wikitext : 9.0082 +/- 0.06633
(I think ARC differences are due to the b3565 merge)

PR rev3 - CURRENT: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.05 GiB (3.76 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.42105263
PPL 512 wikitext : 8.9969 +/- 0.06610

PR4
4.35 GB (3.76 BPW)
4.05 GiB (3.76 BPW)
PPL over 569 chunks for n_ctx=512 = 8.9734 +/- 0.06584


IQ3_M

MASTER : Gemma 2 9b It IQ3_M (with iMatrix, attn_output in Q4_K), quant made from BF16
Size : 4.18 GiB (3.89 BPW)
Arc-C 299     56.85618729
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9697 +/- 0.06598

PR : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS), quant made from BF16
Size : 4.16 GiB (3.87 BPW)
Arc-C 299     57.19063545
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9556 +/- 0.06586

PR rev2 : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K), quant made from BF16
Size : 4.20 GiB (3.90 BPW)²
Arc-C 299     58.52842809²
Arc-E 570     77.54385965²
PPL 512 wikitext : 8.9445 +/- 0.06576²

PR rev3 - CURRENT : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K, attn.k IQ4_XS), quant made from BF16
Size : 4.23 GiB (3.93 BPW)
Arc-C 299     58.19397993
Arc-E 570     77.19298246
PPL 512 wikitext : 8.9082 +/- 0.06536


IQ3_XL

PR CURRENT : Gemma 2 9b It IQ3_XL (with Imatrix), quant made from BF16
Size : 4.50 GiB (4.18 BPW)
Arc-C 299     56.85618729 
Arc-E 570     78.42105263
PPL 512 wikitext : 8.8843 +/- 0.06558


IQ4_XS

MASTER : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.87 GiB (4.52 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.24561404
PPL 512 wikitext : 8.8456 +/- 0.06533

PR CURRENT : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.91 GiB (4.56 BPW)
PPL 512 wikitext : 8.8370 +/- 0.06525


Q4_K_M
Master
Size : 5.13 GiB (4.77 BPW)
PPL 512 wikitext : 8.8367 +/- 0.06523


Q4_K_M
Master
Size : 5.40 GiB (5.02 BPW)
PPL 512 wikitext : 8.8054 +/- 0.06487


Q5_K_S
Master
Size : 6.03 GiB (5.61 BPW)
PPL 512 wikitext : 8.8067 +/- 0.06511


Q5_K_M
Master
Size : 6.19 GiB (5.75 BPW)
PPL 512 wikitext : 8.7973 +/- 0.06502


FP16

MASTER : Gemma 2 9b It F16.
Size : 17.22 GiB (16.00 BPW)
Arc-C 299     59.53177258
Arc-E 570     78.77192982
PPL 512 wikitext : 8.7881 +/- 0.06533
Downloads last month
590
GGUF
Model size
9.24B params
Architecture
gemma2

1-bit

2-bit

3-bit

4-bit

5-bit

Inference API
Unable to determine this model's library. Check the docs .