metadata

license: mit
language:
  - en
  - zh
base_model:
  - deepseek-ai/DeepSeek-V3
pipeline_tag: text-generation
library_name: transformers

DeepSeek V3 AWQ

AWQ of DeepSeek V3.

This quant modified some of the model code to fix an overflow issue when using float16.

To serve using vLLM with 8x 80GB GPUs, use the following command:

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-chat --model cognitivecomputations/DeepSeek-V3-AWQ

The max model length flag ensures that KV cache usage won't be higher than available memory, the moe_wna16 kernel doubles the inference speed, but you must build vLLM from source as of 2025/2/3.
You can download the wheel I built for PyTorch 2.6, Python 3.12 by clicking here.

Inference speed with batch size 1 and short prompt:

8x H100: 34 TPS
8x A100: 27 TPS