Estimated resources per config

#17

by AdrienVeepee - opened Sep 18

Discussion

AdrienVeepee

Sep 18

Hi,

I'm using vLLM to deploy a server for inference, I use a L4 instance with 24GB of memory.

I keep running on torch.OutOfMemoryError: CUDA out of memory.

Can you help me identify how much memory each quantization could take ? Or how to estimate it ?

Thanks !

AdrienVeepee

Sep 18

Runs on a L40S 86% of the vRAM.

Parameters used
CMD ["/opt/conda/envs/vllm_env/bin/vllm", "serve", "mistralai/Pixtral-12B-2409", "--tokenizer_mode", "mistral", "--max-model-len=32768", "--kv-cache-dtype=fp8", "--swap-space=8", "--gpu-memory-utilization=0.9"

AdrienVeepee changed discussion status to closed Sep 18

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment