Inference time with TGI
#15
by
jacktenyx
- opened
Thanks for posting this model, I was able to run inference with TGI on a single 40 GB A100 with the following command:
docker run \
-p 8080:80 \
-e GPTQ_BITS=4 \
-e GPTQ_GROUPSIZE=1 \
--gpus all \
--shm-size 5g \
-v $volume:/data ghcr.io/huggingface/text-generation-inference:latest \
--model-id TheBloke/Llama-2-70B-chat-GPTQ \
--max-input-length 4096 \
--max-total-tokens 8192 \
--quantize gptq \
--sharded false
This was able to generate a response at 225ms/token. However, when running the unquantized model sharded across 4 A100s I was able to get around 45ms/token. Am I missing a config or environment variable that would improve the inference time or is this expected behavior with this quantization?
I get the same number of latency (>100ms/ toekn) with half the length for input and total_tokens. It even slower than using quantization with bitsandbytes-nf4 (~51 ms/token).
This is weird as it is said that gptq is faster for inference than bitsandbytes.
...