Quantization update?

#5
by imi2 - opened

Any chance you or someone could requantize to the latest exl2? No idea the total vram requirements when quantizing and have FOMO on the bpw quality improvements since.

Hi there, sorry, life have been busy. I will try to do it as soon as I can. It takes about 10 hours to do it fully and then ~5 hours per each bpw size on a RTX 4090.

imi2 changed discussion status to closed

No worries, take your time!

P.S. I don't know if it's just me but the newer quantized models take slightly more space on the same system? A 2.4bpw fp8 kv-cache used to go to 16k on a 1x3090 system. The system is unchanged, but now only goes to 8k with fp8 cache.

I have updated the quants, so I suggest to backup in any case.

About the newer quantized sizes, I haven't tested enough yet, but it seems they're fairly similar.

Sign up or log in to comment