Estimated resources per config
#17
by
AdrienVeepee
- opened
Hi,
I'm using vLLM to deploy a server for inference, I use a L4 instance with 24GB of memory.
I keep running on torch.OutOfMemoryError: CUDA out of memory.
Can you help me identify how much memory each quantization could take ? Or how to estimate it ?
Thanks !
Runs on a L40S 86% of the vRAM.
Parameters usedCMD ["/opt/conda/envs/vllm_env/bin/vllm", "serve", "mistralai/Pixtral-12B-2409", "--tokenizer_mode", "mistral", "--max-model-len=32768", "--kv-cache-dtype=fp8", "--swap-space=8", "--gpu-memory-utilization=0.9"
AdrienVeepee
changed discussion status to
closed