quantization gptq_marlin (not found gptq_marlin) not work. , remove it. work.
env:: vllm0.5.3.post
docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000
-v hf_cache:/root/.cache/huggingface
vllm/vllm-openai:latest
--model hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4
--quantization gptq_marlin
--tensor-parallel-size 8
--max-model-len 4096
Hi here @linpan could you please add the error or elaborate more on why it fails? Thanks!
--quantization gptq_marlin not found quantization method.
remove "--quantization gptq_marlin" is working. vllm0.5.3 support gptq_marlin
Well, that's odd, since it should support gptq_marlin
as per https://docs.vllm.ai/en/v0.5.3/models/engine_args.html
--quantization, -q
Possible choices: aqlm, awq, deepspeedfp, fp8, fbgemm_fp8, marlin, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, squeezellm, compressed-tensors, bitsandbytes, None
Method used to quantize the weights. If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.
I guess that those will be used by default anyway, as that's more optimal, but still weird that gptq_marlin
doesn't work, could you please fill an issue at https://github.com/vllm-project/vllm/issues? They will be able to address that better 🤗
From this model config, it is not quantized as marlin_format. So this should be the reason.
If that's the case, then do you mind opening a PR here to replace the gptq_marlin
line within the vLLM command with gptq
instead? Thanks a lot 🤗
If you want marlin, you're probably better off using
https://huggingface.co/neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
It performed about twice as fast on my setup.