Performance / speed of text generation
I am using A100 (40GB) to run this model (falcon-40b-instruct-GPTQ). It's taking roughly 120 seconds to answer a question with a limit of 200 output tokens. Is this expected? Whats the best performance seen so far? - Thanks
I am using the same code as whats given in the model card.
Yeah I'm afraid that is expected with the Falcon GPTQ at the moment. It has a major speed problem that hasn't been resolved yet. I put a note in the README about it.
Recently we got preliminary support for GPU accelerated Falcon GGMLs. I have four repos for those. They perform quite a bit better than the GPTQ. Unfortunately they're not supported in many clients/UIs yet, but they did just get support in ctransformers (Python library including support for Langchain), and also LoLLMS-UI. So you may well find those preferable to the GPTQs.
Another option is to download the original unquantised model and then use load_in_4bit=True
to use bitsandbytes. That's still very slow (maybe 4 tokens/s) and slower than the GGML, but it's faster than the GGML.
Thanks for the update.
Yeah I'm afraid that is expected with the Falcon GPTQ at the moment. It has a major speed problem that hasn't been resolved yet. I put a note in the README about it.
Recently we got preliminary support for GPU accelerated Falcon GGMLs. I have four repos for those. They perform quite a bit better than the GPTQ. Unfortunately they're not supported in many clients/UIs yet, but they did just get support in ctransformers (Python library including support for Langchain), and also LoLLMS-UI. So you may well find those preferable to the GPTQs.
Another option is to download the original unquantised model and then use
load_in_4bit=True
to use bitsandbytes. That's still very slow (maybe 4 tokens/s) and slower than the GGML, but it's faster than the GGML.