metadata
license: mit
language:
- en
- zh
base_model:
- deepseek-ai/DeepSeek-V3
pipeline_tag: text-generation
library_name: transformers
DeepSeek V3 AWQ
AWQ of DeepSeek V3.
This quant modified some of the model code to fix an overflow issue when using float16.
To serve using vLLM with 8x 80GB GPUs, use the following command:
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-chat --model cognitivecomputations/DeepSeek-V3-AWQ
The max model length flag ensures that KV cache usage won't be higher than available memory, the moe_wna16
kernel doubles the inference speed, but you must build vLLM from source as of 2025/2/3.
You can download the wheel I built for PyTorch 2.6, Python 3.12 by clicking here.
Inference speed with batch size 1 and short prompt:
- 8x H100: 34 TPS
- 8x A100: 27 TPS