Memory consumption much higher on multi-GPU setup

#41

by simonesartoni1 - opened Sep 19, 2023

Sep 19, 2023

I have just deployed this model on a g5.12x AWS instance (with 4 A10G GPUs, each one with 24GB) using this setting:

" GPTQ_BITS=4 GPTQ_GROUPSIZE=32 sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Llama-2-70B-chat-GPTQ --num-shard 4 --quantize gptq --revision gptq-4bit-32g-actorder_True"

From the documentation, it should take 40.66 GB, but my current GPU memory is 17GB for each GPU, in total 68GB.

Can someone explain the reason behind the higher GPU consumption?

Yhyu13

Sep 19, 2023

•

edited Sep 19, 2023

If I guess correctly, from my experience, textgen-webui use AutoGPTQ by default with several techniques that increases VRAM usage for sake of inferencing speed. Just checkout the "model" page of textgen-webui and AutoGPTQ loader for deatils.

And still AutGPTQ is a bit slower than ExLLaMAv2_hf loader. With ExLLaMAv2_hf, I can confirm on my local 2x3090 rig, this model consume about 21G/17G after serveral rounds, where as my split is 21G/21G. Would you try that loader instead? There are startup arguments in textgen-webui readme for switching these loaders.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment