File size: 4,957 Bytes
ab6493f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e25068b
6c2b64e
0568bae
 
6c2b64e
0568bae
6c2b64e
 
e25068b
6c2b64e
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
 
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
 
e25068b
6c2b64e
0568bae
 
 
6c2b64e
886b757
 
 
 
 
e25068b
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
e25068b
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
886b757
 
 
 
 
 
 
 
 
 
e25068b
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
886b757
0568bae
 
 
6c2b64e
886b757
 
 
 
 
6c2b64e
 
 
 
0568bae
6c2b64e
 
 
0568bae
 
 
6c2b64e
 
 
 
0568bae
 
 
6c2b64e
886b757
 
 
 
 
 
 
 
 
 
ab6493f
b0882d3
 
 
 
 
 
0fae628
2e13fc3
 
b0882d3
0fae628
 
 
 
 
 
 
886b757
b0882d3
 
 
 
 
 
2e13fc3
 
 
b0882d3
886b757
3ec32c3
 
 
2e13fc3
 
3ec32c3
 
 
 
 
886b757
ba7258b
 
 
 
 
 
7c45a76
ba7258b
 
 
886b757
7c45a76
 
 
 
 
 
886b757
7c45a76
 
 
 
 
 
 
886b757
 
7c45a76
 
 
 
 
886b757
ab6493f
 
 
 
 
 
 
 
886b757
ab6493f
 
7c45a76
2668728
ab6493f
 
2668728
ab6493f
886b757
ab6493f
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
license: llama3.1
---

Experimental .GGUF quants for https://huggingface.co/google/gemma-2-9b-it accordingly to LCPP PR
(based on b_3529 and now b_3565 for the newer ones) : https://github.com/ggerganov/llama.cpp/pull/8836

These experimental quant strategies revisiting Ikawrakow's work are displaying a slight decrease of perplexity,
including per bpw (from 10%+ for the lowest quants to 0.x% for the highest ones).
This is significant enough to encourage you folks to test them, and provide feedback if pertinent.

The iMatrix I use is based on Group Merged V3 and enriched with a bit of French,
a bit of Serbian, and a bit of Croatian languages.


ARC and PPL-512 DATA (Get the last data on the main post of the PR thread) :

```
IQ1_XS - Unusable on <30B models
PR
1.94 GB (1.93 BPW)
1.81 GiB (1.93 BPW)

PPL over 564 chunks for n_ctx=512 = 40.0024 +/- 0.27710


IQ1_S - Unusable on <30B models
Master
2.01 GB (2.00 BPW)
1.87 GiB (2.00 BPW)
PPL over 564 chunks for n_ctx=512 = 61.2817 +/- 0.41707

PR
2.05 GB (2.04 BPW)
1.91 GiB (2.04 BPW)
PPL over 564 chunks for n_ctx=512 = 25.2524 +/- 0.17651


IQ1_M
Master
2.15 GB (2.15 BPW)
2.01 GiB (2.15 BPW)
PPL over 564 chunks for n_ctx=512 = 26.3761 +/- 0.18200

PR
2.14 GB (2.13 BPW)
1.99 GiB (2.13 BPW)
PPL over 564 chunks for n_ctx=512 = 20.0588 +/- 0.14001


IQ1_XL - Unusable on <= 13b models
PR
2.21 GB (2.21 BPW)
2.06 GiB (2.21 BPW)
PPL over 564 chunks for n_ctx=512 = 18.5500 +/- 0.12753

PR2
2.23 GB (2.22 BPW)
2.08 GiB (2.22 BPW)
PPL over 564 chunks for n_ctx=512 = 17.4537 +/- 0.11995

PR3
2.25 GB (2.25 BPW)
2.10 GiB (2.25 BPW)
PPL over 564 chunks for n_ctx=512 = 17.3669 +/- 0.11928


IQ2_XXS
Master
2.39 GB (2.38 BPW)
2.23 GiB (2.38 BPW)
PPL over 564 chunks for n_ctx=512 = 15.2572 +/- 0.10267

PR
2.38 GB (2.37 BPW)
2.22 GiB (2.37 BPW)
PPL over 564 chunks for n_ctx=512 = 13.8073 +/- 0.09290

PR2
2.40 GB (2.39 BPW)
2.23 GiB (2.39 BPW)
PPL over 564 chunks for n_ctx=512 = 12.9671 +/- 0.08687


IQ2_XS
Master
2.60 GB (2.59 BPW)
2.42 GiB (2.59 BPW)
PPL over 564 chunks for n_ctx=512 = 11.7483 +/- 0.07776

PR
2.52 GB (2.51 BPW)
2.35 GiB (2.51 BPW)
PPL over 564 chunks for n_ctx=512 = 11.6639 +/- 0.07805

PR2
2.53 GB (2.52 BPW)
2.36 GiB (2.52 BPW)
PPL over 564 chunks for n_ctx=512 = 11.5685 +/- 0.07742

PR3
2.58 GB (2.57 BPW)
2.40 GiB (2.57 BPW)
PPL over 564 chunks for n_ctx=512 = 11.3031 +/- 0.07514

PR4
2.59 GB (2.58 BPW)
2.42 GiB (2.58 BPW)
PPL over 564 chunks for n_ctx=512 = 10.9291 +/- 0.07270


IQ2_S
Master
2.75 GB (2.74 BPW)
2.56 GiB (2.74 BPW)
PPL over 564 chunks for n_ctx=512 = 10.5180 +/- 0.06976

PR (fail)
2.71 GB (2.70 BPW)
2.52 GiB (2.70 BPW)
PPL over 564 chunks for n_ctx=512 = 10.7010 +/- 0.07027

PR2
2.75 GB (2.74 BPW)
2.56 GiB (2.74 BPW)
PPL over 564 chunks for n_ctx=512 = 10.3728 +/- 0.06806


IQ2_M
Master
2.94 GB (2.93 BPW)
2.74 GiB (2.93 BPW)
PPL over 564 chunks for n_ctx=512 = 9.5935 +/- 0.06228

PR
2.93 GB (2.92 BPW)
2.73 GiB (2.92 BPW)
PPL over 564 chunks for n_ctx=512 = 9.4125 +/- 0.06039


IQ2_XL
PR
2.99 GB (2.98 BPW)
2.78 GiB (2.98 BPW)
PPL over 564 chunks for n_ctx=512 = 9.3122 +/- 0.05973

PR2
3.11 GB (3.10 BPW)
2.90 GiB (3.10 BPW)
PPL over 564 chunks for n_ctx=512 = 9.0378 +/- 0.05764

PR3
3.14 GB (3.13 BPW)
2.93 GiB (3.13 BPW)
PPL over 564 chunks for n_ctx=512 = 8.8604 +/- 0.05620


IQ3_XXS

Master
Size : 3.04 GiB (3.25 BPW)
PPL 512 wikitext : 8.4985 +/- 0.05402

PR (good)
Size : 3.11 GiB (3.32 BPW)
PPL 512 wikitext : 8.3274 +/- 0.05334

PR2 (so so)
llm_load_print_meta: model size       = 3.08 GiB (3.29 BPW)
llm_load_print_meta: model size       = 3.30 GB (3.29 BPW)
Final estimate: PPL 512 = 8.3906 +/- 0.05329

Let's keep the first PR


IQ3_XS

Master
Size : 3.27 GiB (3.50 BPW)
PPL 512 wikitext : 8.2019 +/- 0.05167

PR (ok)
Size : 3.24 GiB (3.47 BPW)
PPL 512 wikitext : 8.1762 +/- 0.05176


IQ3_S

Master
Size : 3.42 GiB (3.66 BPW)
PPL 512 wikitext : 7.9894 +/- 0.05020

PR (good)
Size : 3.41 GiB (3.64 BPW)
PPL 512 wikitext : 7.9067 +/- 0.05022


IQ3_M

Master
Size : 3.52 GiB (3.76 BPW)  
PPL 512 wikitext : 7.9263 +/- 0.04943

PR (good)
Size : 3.49 GiB (3.73 BPW)
PPL 512 wikitext : 7.8704 +/- 0.04951


IQ3_XL

PR (good)
Size : 3.71 GiB (3.97 BPW)
PPL 512 wikitext : 7.7225 +/- 0.04946


IQ3_XXL

PR (good, the benefit seems meager but the token embeddings pushed form IQ3_S to IQ4_XS explains +0.05BPW of it,
and this tensor doesn't run in VRAM but in RAM)
Size : 3.83 GiB (4.09 BPW)
PPL 512 wikitext : 7.6720 +/- 0.04892


IQ3_XXXL

PR (good)
Size : 3.97 GiB (4.24 BPW)
PPL 512 wikitext : 7.5920 +/- 0.04839


IQ4_XS

Master
Size : 4.13 GiB (4.42 BPW)
Arc-C 299     49.16387960    
Arc-E 570     72.10526316     
PPL 512 wikitext : 7.5226 +/- 0.04820


IQ4_XSR

PR (good)
Size : 4.16 GiB (4.45 BPW)
Arc-C 299    
Arc-E 570      
PPL 512 wikitext : 7.5072 +/- 0.04814


FP16

MASTER : Gemma 2 9b It F16.
Size : 14.96 GiB (16.00 BPW)
Arc-C 299     49.49832776
Arc-E 570     73.85964912
PPL 512 wikitext : 7.3224 +/- 0.04674

```