File size: 4,957 Bytes

ab6493f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e25068b
6c2b64e
0568bae
 
6c2b64e
0568bae
6c2b64e
 
e25068b
6c2b64e
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
 
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
 
e25068b
6c2b64e
0568bae
 
 
6c2b64e
886b757
 
 
 
 
e25068b
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
e25068b
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
886b757
 
 
 
 
 
 
 
 
 
e25068b
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
886b757
0568bae
 
 
6c2b64e
886b757
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
 
 
 
0568bae
 
 
6c2b64e
886b757
 
 
 
 
 
 
 
 
 
ab6493f
b0882d3
 
 
 
 
 
0fae628
2e13fc3
 
b0882d3
0fae628
 
 
 
 
 
 
886b757
b0882d3
 
 
 
 
 
2e13fc3
 
 
b0882d3
886b757
3ec32c3
 
 
2e13fc3
 
3ec32c3
 
 
 
 
886b757
ba7258b
 
 
 
 
 
7c45a76
ba7258b
 
 
886b757
7c45a76
 
 
 
 
 
886b757
7c45a76
 
 
 
 
 
 
886b757
 
7c45a76
 
 
 
 
886b757
ab6493f
 
 
 
 
 
 
 
886b757
ab6493f
 
7c45a76
2668728
ab6493f
 
2668728
ab6493f
886b757
ab6493f

---
license: llama3.1
---

Experimental .GGUF quants for https://huggingface.co/google/gemma-2-9b-it accordingly to LCPP PR
(based on b_3529 and now b_3565 for the newer ones) : https://github.com/ggerganov/llama.cpp/pull/8836

These experimental quant strategies revisiting Ikawrakow's work are displaying a slight decrease of perplexity,
including per bpw (from 10%+ for the lowest quants to 0.x% for the highest ones).
This is significant enough to encourage you folks to test them, and provide feedback if pertinent.

The iMatrix I use is based on Group Merged V3 and enriched with a bit of French,
a bit of Serbian, and a bit of Croatian languages.


ARC and PPL-512 DATA (Get the last data on the main post of the PR thread) :

```
IQ1_XS - Unusable on <30B models
PR
1.94 GB (1.93 BPW)
1.81 GiB (1.93 BPW)

PPL over 564 chunks for n_ctx=512 = 40.0024 +/- 0.27710


IQ1_S - Unusable on <30B models
Master
2.01 GB (2.00 BPW)
1.87 GiB (2.00 BPW)
PPL over 564 chunks for n_ctx=512 = 61.2817 +/- 0.41707

PR
2.05 GB (2.04 BPW)
1.91 GiB (2.04 BPW)
PPL over 564 chunks for n_ctx=512 = 25.2524 +/- 0.17651


IQ1_M
Master
2.15 GB (2.15 BPW)
2.01 GiB (2.15 BPW)
PPL over 564 chunks for n_ctx=512 = 26.3761 +/- 0.18200

PR
2.14 GB (2.13 BPW)
1.99 GiB (2.13 BPW)
PPL over 564 chunks for n_ctx=512 = 20.0588 +/- 0.14001


IQ1_XL - Unusable on <= 13b models
PR
2.21 GB (2.21 BPW)
2.06 GiB (2.21 BPW)
PPL over 564 chunks for n_ctx=512 = 18.5500 +/- 0.12753

PR2
2.23 GB (2.22 BPW)
2.08 GiB (2.22 BPW)
PPL over 564 chunks for n_ctx=512 = 17.4537 +/- 0.11995

PR3
2.25 GB (2.25 BPW)
2.10 GiB (2.25 BPW)
PPL over 564 chunks for n_ctx=512 = 17.3669 +/- 0.11928


IQ2_XXS
Master
2.39 GB (2.38 BPW)
2.23 GiB (2.38 BPW)
PPL over 564 chunks for n_ctx=512 = 15.2572 +/- 0.10267

PR
2.38 GB (2.37 BPW)
2.22 GiB (2.37 BPW)
PPL over 564 chunks for n_ctx=512 = 13.8073 +/- 0.09290

PR2
2.40 GB (2.39 BPW)
2.23 GiB (2.39 BPW)
PPL over 564 chunks for n_ctx=512 = 12.9671 +/- 0.08687


IQ2_XS
Master
2.60 GB (2.59 BPW)
2.42 GiB (2.59 BPW)
PPL over 564 chunks for n_ctx=512 = 11.7483 +/- 0.07776

PR
2.52 GB (2.51 BPW)
2.35 GiB (2.51 BPW)
PPL over 564 chunks for n_ctx=512 = 11.6639 +/- 0.07805

PR2
2.53 GB (2.52 BPW)
2.36 GiB (2.52 BPW)
PPL over 564 chunks for n_ctx=512 = 11.5685 +/- 0.07742

PR3
2.58 GB (2.57 BPW)
2.40 GiB (2.57 BPW)
PPL over 564 chunks for n_ctx=512 = 11.3031 +/- 0.07514

PR4
2.59 GB (2.58 BPW)
2.42 GiB (2.58 BPW)
PPL over 564 chunks for n_ctx=512 = 10.9291 +/- 0.07270


IQ2_S
Master
2.75 GB (2.74 BPW)
2.56 GiB (2.74 BPW)
PPL over 564 chunks for n_ctx=512 = 10.5180 +/- 0.06976

PR (fail)
2.71 GB (2.70 BPW)
2.52 GiB (2.70 BPW)
PPL over 564 chunks for n_ctx=512 = 10.7010 +/- 0.07027

PR2
2.75 GB (2.74 BPW)
2.56 GiB (2.74 BPW)
PPL over 564 chunks for n_ctx=512 = 10.3728 +/- 0.06806


IQ2_M
Master
2.94 GB (2.93 BPW)
2.74 GiB (2.93 BPW)
PPL over 564 chunks for n_ctx=512 = 9.5935 +/- 0.06228

PR
2.93 GB (2.92 BPW)
2.73 GiB (2.92 BPW)
PPL over 564 chunks for n_ctx=512 = 9.4125 +/- 0.06039


IQ2_XL
PR
2.99 GB (2.98 BPW)
2.78 GiB (2.98 BPW)
PPL over 564 chunks for n_ctx=512 = 9.3122 +/- 0.05973

PR2
3.11 GB (3.10 BPW)
2.90 GiB (3.10 BPW)
PPL over 564 chunks for n_ctx=512 = 9.0378 +/- 0.05764

PR3
3.14 GB (3.13 BPW)
2.93 GiB (3.13 BPW)
PPL over 564 chunks for n_ctx=512 = 8.8604 +/- 0.05620


IQ3_XXS

Master
Size : 3.04 GiB (3.25 BPW)
PPL 512 wikitext : 8.4985 +/- 0.05402

PR (good)
Size : 3.11 GiB (3.32 BPW)
PPL 512 wikitext : 8.3274 +/- 0.05334

PR2 (so so)
llm_load_print_meta: model size       = 3.08 GiB (3.29 BPW)
llm_load_print_meta: model size       = 3.30 GB (3.29 BPW)
Final estimate: PPL 512 = 8.3906 +/- 0.05329

Let's keep the first PR


IQ3_XS

Master
Size : 3.27 GiB (3.50 BPW)
PPL 512 wikitext : 8.2019 +/- 0.05167

PR (ok)
Size : 3.24 GiB (3.47 BPW)
PPL 512 wikitext : 8.1762 +/- 0.05176


IQ3_S

Master
Size : 3.42 GiB (3.66 BPW)
PPL 512 wikitext : 7.9894 +/- 0.05020

PR (good)
Size : 3.41 GiB (3.64 BPW)
PPL 512 wikitext : 7.9067 +/- 0.05022


IQ3_M

Master
Size : 3.52 GiB (3.76 BPW)  
PPL 512 wikitext : 7.9263 +/- 0.04943

PR (good)
Size : 3.49 GiB (3.73 BPW)
PPL 512 wikitext : 7.8704 +/- 0.04951


IQ3_XL

PR (good)
Size : 3.71 GiB (3.97 BPW)
PPL 512 wikitext : 7.7225 +/- 0.04946


IQ3_XXL

PR (good, the benefit seems meager but the token embeddings pushed form IQ3_S to IQ4_XS explains +0.05BPW of it,
and this tensor doesn't run in VRAM but in RAM)
Size : 3.83 GiB (4.09 BPW)
PPL 512 wikitext : 7.6720 +/- 0.04892


IQ3_XXXL

PR (good)
Size : 3.97 GiB (4.24 BPW)
PPL 512 wikitext : 7.5920 +/- 0.04839


IQ4_XS

Master
Size : 4.13 GiB (4.42 BPW)
Arc-C 299     49.16387960    
Arc-E 570     72.10526316     
PPL 512 wikitext : 7.5226 +/- 0.04820


IQ4_XSR

PR (good)
Size : 4.16 GiB (4.45 BPW)
Arc-C 299    
Arc-E 570      
PPL 512 wikitext : 7.5072 +/- 0.04814


FP16

MASTER : Gemma 2 9b It F16.
Size : 14.96 GiB (16.00 BPW)
Arc-C 299     49.49832776
Arc-E 570     73.85964912
PPL 512 wikitext : 7.3224 +/- 0.04674

```