Text Generation
Transformers
English
llama
Inference Endpoints

Slow inference speed using default vs 3rd party tokenizer files

#19
by mancub - opened

I downloaded the tokenizer files that @reeducator uploaded for the safetensors model and noticed a big drop in inference speed from what I previously had been using (files from TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g model).

When using the default files included with this model, I get 1.5-1.7 tokens/s, while with the files from TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g I get 5.7 to over 6 tokens/s. This is on a RTX 3090 with CUDA.

At first I thought there could've been some regression with oogabooga/text-generation-webui so I went back a week, without any change. I do not know enough about the inner workings of these models to understand what could be the problem, so perhaps someone else can confirm this, or let me know if I am doing something wrong here?

I noticed the same thing on my RTX 3060, I changed back to TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g model and speed is back to 4-5 tokens/s.

You guys have to set use_cache to true in the config.json - that is very important for the speed. This fixes the slow speeds.

Thanks @CyberTimon for the heads up, and thanks @reeducator for updating the config.json so quickly !

mancub changed discussion status to closed

Sign up or log in to comment