I have a question with the tokens
I have a pc with an rtx 3060 (12GB), i7 10700 and 32GB ram running a model of "gptj 6b GPTQ-4bit-128g" in text-generation-webui.
This model reaches a maximum of 8GB in memory, and even though the speed is very good, I would like to know if the number of tokens per second can be further accelerated. so that the waiting time when using the api in python or tavernIA is faster in seconds. (since they both send more than one request to the text-generation-webui api before delivering a single chat response.) do you know of any settings that can speed up this process?
In text-gen-ui you can try the xformers
parameter, apparently that can improve performance a bit
I recently learned that single prompt performance is bottlenecked on CPU for most users. If you run a prompt and look at your GPU usage %, you will likely see it is <100%, maybe a lot lower like 50% or 25%. This is because Python is using 100% of one core, and that is the limit to performance.
If you're using the model from Python code and want to send many prompts, then for maximum performance you shouldn't use text-gen-ui and instead write your own code using Hugging Face pipeline
(https://huggingface.co/docs/transformers/main_classes/pipelines) with batched data. You can get massive speed improvements by batching data with pipelines, because then CPU is no longer the bottleneck and it can use the full GPU power.
I am working on Python code that enables easily loading GPTQ models and running inference using AutoGPTQ. It will be ready to release in the next 24 hours. I'll link the code to you when it's ready and you can try using it as a base for your task. I think you will get better performance with that.
god you are amazing please let me know when you have the codes. I'm too excited!! thank you so much.
Hello, sorry for the inconvenience, I am very interested in the Python code that you are working on, I think that you know too much and I am not at that level. don't forget to tell me if there is any new update. thank you so much.
Sorry for the delay. I still haven't released the code because I've not had time to clean it up.
But what I said about batched data for pipeline may not be true.. or at least, it has problems. I tested it again yesterday and got corrupt data back. It seems that pipeline with batches works fine if the prompts are all the same or quite similar. But when using a batch of very different prompts, something strange happens. Some prompts return OK, others come back with gibberish. And I've just not had time to investigate why that is, yet. I don't know if it's a bug or something I'm doing wrong, or just an expected result.
Ping me again tomorrow and I'll send you the current code