The inference performance of the DeepSeek-R1-AWQ model is weak compared to the DeepSeek-R1 model

by qingqingz916 - opened 5 days ago

5 days ago

•

We conducted inference performance tests on the DeepSeek-R1 model and the DeepSeek-R1-AWQ model under the GPU configuration of H200*8 based on the input length of 1000/100, 2000/200, 5000/500 and the batch size from 1 to 50. The results showed that except for the case where the batch size is equal to 1 and 1000/100, the throughput index of the AWQ R1 model is slightly superior. The test data of the remaining conditions show that The performance indicators of the DeepSeek-R1-AWQ model lag far behind the DeepSeek-R1 model.
Is this unexpected test result because our experimental methods are not completely aligned, or is the DeepSeek-R1-AWQ model really not suitable for scenarios with long inputs and large batch sizes?
DeepSeek-R1-AWQ model：

DeepSeek-R1 model：

qingqingz916 changed discussion title from The inference performance of the AWQ R1 model is weak compared to the R1 model to The inference performance of the DeepSeek-R1-AWQ model is weak compared to the DeepSeek-R1 model 5 days ago

v2ray

Cognitive Computations org 5 days ago

If you are using MLA on DeepSeek R1 then it indeed will be much much faster under high batch size, the vLLM implementation doesn't support MLA for AWQ yet.

Also under high batch size, model inference will become more and more compute bound instead of memory bound, so the advantage of low bit quantization becomes less and less.

qingqingz916

4 days ago

very thankful

radna

4 days ago

May I ask if MLA is supported on Ada GPUs like L4? And what kind of configuration is supported with MLA for vLLM? Other than AWQ not being supported as you say.

I have only seen this old closed issue and this recent blog about MLA now being supported in vLLM?
https://github.com/vllm-project/vllm/issues/4625
https://neuralmagic.com/blog/enhancing-deepseek-models-with-mla-and-fp8-optimizations-in-vllm/

Also what is moe_wna16 that you mention in the description? And how to enable that with vLLM? After a bit of searching I saw that quantization_config from HF Config can be used to alter quant methods but how would I do that for moe_wna16?

v2ray

Cognitive Computations org 4 days ago

@ran-1 MLA should be supported on any hardware that supports FP8 scaled mm (this includes L4), but in the case of L4, you would need a lot of GPUs to fit them inside VRAM with FP8 dtype, I don't know if the current MLA supports multi-node or not. If you are using say 8x H200, you don't need to pass any extra CLI args, MLA will be enabled by default if the current model supports it. If the current model doesn't support it, vLLM will print a warning stating that MLA has been disabled.

The moe_wna16 kernel is a Triton kernel made for boosting high expert count MoE architecture's (DeepSeek V3/R1 has 256 experts) performance by not launching N CUDA kernels for N experts, you don't need to modify the quantization config, just pass --quantization moe_wna16 flag to vLLM in CLI.

radna

4 days ago

•

edited 4 days ago

@v2ray Thanks for responding! I assume all of this will only work with vllm 0.7.2 only and that version needs to built from source? I'm not running this model specifically but the casperhansen/deepseek-r1-distill-qwen-14b-awq, will MLA be supported?

Also I'm not running this through the CLI, so how would I pass the moe_wna16 kernel config through code or through the HF config? Do you have any example of this?

v2ray

Cognitive Computations org 4 days ago

•

edited 4 days ago

@radna You can read vLLM's documentation on how to change quantization, and you can use the new release now, you don't need to build it from source.

MLA is only supported on the base DeepSeek models, NOT on the distilled versions.

Also if you are using the distilled version you don't need to use moe_wna16 kernel, you only need it on the base models.

casperhansen

3 days ago

@radna vLLM is currently not optimized for AWQ DeepSeek because of the following:

MLA not implemented (big performance hit)
Multi-Token Prediction is not implemented in Transformers nor vLLM, causing a hit to accuracy which may translate to a bigger punishment for quantizated models
Under higher batch sizes, you cannot exploit weight-quantization for higher throughput due to the nature of where weight-quantization performs the best. This is known as memory-bound vs compute-bound.

ehartford

Cognitive Computations org 1 day ago

@casperhansen is there a GitHub issue for that?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment