Text Generation
Transformers
PyTorch
English
llama
causal-lm
text-generation-inference
Inference Endpoints

response generation too slow

#9
by hussainwali1 - opened

is there any way to speed up the generation? and it keeps on generating

This is an unquantised model so it does require a lot of VRAM and does a lot of calculations.

If you have an NVidia GPU you could use a quantised model like https://huggingface.co/TheBloke/stable-vicuna-13B-GPTQ . That should run faster and need less VRAM.

How are you running the model?

Sign up or log in to comment