vllm

Estimated resources per config

#17
by AdrienVeepee - opened

Hi,

I'm using vLLM to deploy a server for inference, I use a L4 instance with 24GB of memory.

I keep running on torch.OutOfMemoryError: CUDA out of memory.

Can you help me identify how much memory each quantization could take ? Or how to estimate it ?

Thanks !

Runs on a L40S 86% of the vRAM.

Parameters used
CMD ["/opt/conda/envs/vllm_env/bin/vllm", "serve", "mistralai/Pixtral-12B-2409", "--tokenizer_mode", "mistral", "--max-model-len=32768", "--kv-cache-dtype=fp8", "--swap-space=8", "--gpu-memory-utilization=0.9"

AdrienVeepee changed discussion status to closed

Sign up or log in to comment