Quantization results in model not supporting Tensor Parallel mode.

#3
by stev236 - opened

Had great hopes for this model to compete with Alibaba's Ling2 or Qwen3-4b, but, unfortunately, it seems like it can't support any tensor parallel modes (2 or 4) if quantized. Given the model size, that makes it unusable for many who run multi-gpu setups. Here's the error message I get when trying to serve with vllm using --tensor-parallel-size of 4 or 2.

[multiproc_executor.py:597] AssertionError: Tensor parallel currently supported for quantized models only if tensor parallel world size divides num groups.

I think the error refers to the group size for quantization (128) not dividing some value of 192 from the model.
Any quantization expert that could tell me if there's a way to solve this problem?
Would a different quantization group size solve the problem?
Thanks for any help.

stev236 changed discussion title from Tensor parallel not supported for quantized models ?!? to Quantization results in model not supporting Tensor Parallel mode.

Sign up or log in to comment