OOM on RTX 3090
Hey,
just tried out the model with vllm and I am getting OOM errors.
Shouldn't the 34B model run on a single RTX 3090? I also tried using two 3090s but I am still getting OOM with vllm 0.2.0
My inference code:
from vllm import LLM, SamplingParams
llm = LLM(
model="TheBloke/Phind-CodeLlama-34B-v2-AWQ",
tensor_parallel_size=1,
dtype="half",
quantization="awq",
gpu_memory_utilization=0.01,
swap_space=30
)
python -m vllm.entrypoints.api_server --model TheBloke/Phind-CodeLlama-34B-v2-AWQ --quantization awq --dtype=float16 --tensor-parallel-size 2 --gpu-memory-utilization 0.1
I also tried various other gpu-memory-utilization like 0.4, 0.5, 0.8, ...
You need at least 18GB of VRAM, single card with gpu-memory-utilization 1 should be enought.
vLLM's AWQ does not support mult-ti gpu for the time being. try GPTQ version for dual-gpu inferencing instead
The 3090 has 24 GB of VRAM, so it should easily handle the model. Also vllm supports tensor-parallel execution with multi-GPUs.
I found the problem. The default max-model-len (param)/ model_max_len (Python-API) seems to be too long. Setting it manually to 4000 works for a single 3090. For two RTX 3090 >6000 is possible.