vLLM on A100s
Im running into an CUDA OOM issue.. though Im trying to serve a bf16 version I found here on HF (opensourcerelease/DeepSeek-V3-bf16) since I have A100s.
I have access to only A100s, and I have 9 nodes with 2 GPUs (so 18 A100s of 80 GBs, 1440 GBs in total). I though 685B params would be somewhere around 1350 GBs plus some overhead for half precision. Any thoughts? I am also trying to unload to CPU but still getting CUDA OOM...
vllm serve opensourcerelease/DeepSeek-V3-bf16
--dtype bfloat16
--host 0.0.0.0
--port 5000
--gpu-memory-utilization 0.7
--cpu-offload-gb 540
--tensor-parallel-size 2
--pipeline-parallel-size 9
--trust-remote-code
any thoughts? pls help :(
Mark, I also encountered a similar problem.
The official provided pipelines for H series graphics cards, but it seems that there is no example for A100 series cards.
Also, fp8 model requires 16 cards of h20, I thought the bf16 model required ~32 cards of a100 (https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/discussions/5).
Has anyone tried quantization?
@HuggingLianWang really? 32 is like a lot of GPUs.. why 32??
There's no point in running it in bf16, since the model is trained in fp8
@xiaoqianWX A100s can't operate fp8..