- GGUF: perfect for inference on CPUs (and LM Studio) - GPTQ/EXL2: fast inference on GPUs - AWQ: super fast inference on GPUs with vLLM (https://github.com/vllm-project/vllm) - HQQ: extreme quantization with decent 2-bit and 3-bit models
Once the model is converted, it automatically uploads it on the Hugging Face Hub. To quantize a 7B model, GGUF only needs a T4 GPU, while the other methods require an A100 GPU.