TheBloke/Phind-CodeLlama-34B-v2-AWQ

Oct 10, 2023

Hey,
just tried out the model with vllm and I am getting OOM errors.
Shouldn't the 34B model run on a single RTX 3090? I also tried using two 3090s but I am still getting OOM with vllm 0.2.0

My inference code:

from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Phind-CodeLlama-34B-v2-AWQ",
    tensor_parallel_size=1,
    dtype="half",
    quantization="awq",
    gpu_memory_utilization=0.01,
    swap_space=30
)

python -m vllm.entrypoints.api_server --model TheBloke/Phind-CodeLlama-34B-v2-AWQ  --quantization awq --dtype=float16 --tensor-parallel-size 2 --gpu-memory-utilization 0.1

I also tried various other gpu-memory-utilization like 0.4, 0.5, 0.8, ...

Yhyu13

Oct 10, 2023

You need at least 18GB of VRAM, single card with gpu-memory-utilization 1 should be enought.

vLLM's AWQ does not support mult-ti gpu for the time being. try GPTQ version for dual-gpu inferencing instead

SebastianBodza

Oct 10, 2023

The 3090 has 24 GB of VRAM, so it should easily handle the model. Also vllm supports tensor-parallel execution with multi-GPUs.

I found the problem. The default max-model-len (param)/ model_max_len (Python-API) seems to be too long. Setting it manually to 4000 works for a single 3090. For two RTX 3090 >6000 is possible.

SebastianBodza changed discussion status to closed Oct 10, 2023

TheBloke
/

Phind-CodeLlama-34B-v2-AWQ

OOM on RTX 3090