3-Bit quantization?
#3
by
nicostouch
- opened
Since llama.cpp using the new GGML format can split between CPU and GPU the performance of the q4_0 model offloading 40 layers to an RTX4090 and running the rest on 79003DX gets just under 2 tokens/s, which is just shy of feeling usable. I was wondering if it's possible to do 3-Bit quantization and if the trade-off in perplexity / speed might provide better output than the 30B models, while running at decent speed?
Yeah maybe. GPTQ does support 3bit quantisation (for CUDA only, not Triton). I haven't tested it, and have a feeling that very few people have and so there's quite possibly going to be issues and bugs.
But I will make a note to try it out with AutoGPTQ sometime soon and see how it goes.