Basic question about 4-bit quantization
Hi,
Pardon me for asking this, but I have a very basic question about 4-bit quantization. How are these 4-bit quantized weights loaded in PyTorch (through HF AutoModelForCausalLM
API) when PyTorch doesn't natively support int4
?
For e.g., I understand how 4-bit quantized vectors (or matrixes) and the corresponding fp32 scaling factor and zero points can be stored contiguously as is explained here, however, I am not clear about how the computations are being done in PyTorch when it doesn't support a native int4
data type.
Thanks!
You can't load 4bit models in native transformers at the moment. You may be able to do so soon, when bitsandbytes
releases its new 4bit mode. However then you would use the base float16 model, with something like load_in_4bit=True
(not sure exactly as it's not released yet) - same principle as their current 8bit quantisations.
To load GPTQ 4bit models you need to use compatible code.
There's a relatively new repo called AutoGPTQ which aims to make it as easy as possible to load GPTQ models and then use them with standard transformers code. You still don't use AutoModelForCausalLM
- instead you use AutoGPTQForCausalLM
- but once the model is loaded, then you can use any normal transformers code.
Thanks @TheBloke for the reply and all the great work you do in providing quantized 4-bit models to the community.