Performance / speed of text generation

#19
by mattkallo - opened

I am using A100 (40GB) to run this model (falcon-40b-instruct-GPTQ). It's taking roughly 120 seconds to answer a question with a limit of 200 output tokens. Is this expected? Whats the best performance seen so far? - Thanks

I am using the same code as whats given in the model card.

Yeah I'm afraid that is expected with the Falcon GPTQ at the moment. It has a major speed problem that hasn't been resolved yet. I put a note in the README about it.

Recently we got preliminary support for GPU accelerated Falcon GGMLs. I have four repos for those. They perform quite a bit better than the GPTQ. Unfortunately they're not supported in many clients/UIs yet, but they did just get support in ctransformers (Python library including support for Langchain), and also LoLLMS-UI. So you may well find those preferable to the GPTQs.

Another option is to download the original unquantised model and then use load_in_4bit=True to use bitsandbytes. That's still very slow (maybe 4 tokens/s) and slower than the GGML, but it's faster than the GGML.

Thanks for the update.

Yeah I'm afraid that is expected with the Falcon GPTQ at the moment. It has a major speed problem that hasn't been resolved yet. I put a note in the README about it.

Recently we got preliminary support for GPU accelerated Falcon GGMLs. I have four repos for those. They perform quite a bit better than the GPTQ. Unfortunately they're not supported in many clients/UIs yet, but they did just get support in ctransformers (Python library including support for Langchain), and also LoLLMS-UI. So you may well find those preferable to the GPTQs.

Another option is to download the original unquantised model and then use load_in_4bit=True to use bitsandbytes. That's still very slow (maybe 4 tokens/s) and slower than the GGML, but it's faster than the GGML.

mattkallo changed discussion status to closed

Sign up or log in to comment