vllm support a100

by HuggingLianWang - opened 20 days ago

Discussion

HuggingLianWang

20 days ago

•

edited 20 days ago

Can this model be served directly using vllm on 8xA100(80GB)?

HuggingLianWang changed discussion title from vllm support to vllm support a100 20 days ago

v2ray

Cognitive Computations org 20 days ago

Yes but it will run with like 3.7 tokens per second.

HuggingLianWang

19 days ago

Yes but it will run with like 3.7 tokens per second.

Thank you very much , we will try

HuggingLianWang

12 days ago

succeed
inference speed about 3.5 tokens/s with batch size 1 on 8xA100(80GB)

v2ray

Cognitive Computations org 12 days ago

There's a PR which claims to boost it to 30 tokens per second, not tried tho.

21world

12 days ago

•

edited 12 days ago

very good 3t/s 8xA100

jinleic

5 days ago

vllm serve cognitivecomputations/DeepSeek-V3-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

works and 5.2T/s for 8 x A100

jinleic

5 days ago

vllm serve cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

works and 5.2T/s for 8 x A100 as well

kuliev-vitaly

4 days ago

Can i use it with several 3090 in docker and cpu-offload? Is it possible to start model in cpu mode only?

jinleic

3 days ago

cannot say for sure, but high unlikely....

v2ray

Cognitive Computations org 2 days ago

Currently it will error out when using cpu-offload, and even if it's eventually supported, it will still be extremely slow. @kuliev-vitaly

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment