vllm support a100
Can this model be served directly using vllm on 8xA100(80GB)?
Yes but it will run with like 3.7 tokens per second.
Yes but it will run with like 3.7 tokens per second.
Thank you very much , we will try
succeed
inference speed about 3.5 tokens/s with batch size 1 on 8xA100(80GB)
There's a PR which claims to boost it to 30 tokens per second, not tried tho.
very good 3t/s 8xA100
vllm serve cognitivecomputations/DeepSeek-V3-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16
works and 5.2T/s for 8 x A100
vllm serve cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16
works and 5.2T/s for 8 x A100 as well
Can i use it with several 3090 in docker and cpu-offload? Is it possible to start model in cpu mode only?
cannot say for sure, but high unlikely....
Currently it will error out when using cpu-offload, and even if it's eventually supported, it will still be extremely slow. @kuliev-vitaly