GGUF
Not-For-All-Audiences
nsfw
Inference Endpoints

Load model text-generation-webui issues

#4
by lazyDataScientist - opened

Running into an issue while using Runpod with a A100. After downloading the model I get this error message for all versions of the model (both Qn_0 and Qn_k).
You mentioned that you got it working on a single A100, did you need to do any extra steps to get the text-generation-webui working with Mixtral models?

Traceback (most recent call last):

File "/workspace/text-generation-webui/modules/ui_model_menu.py", line 209, in load_model_wrapper


shared.model, shared.tokenizer = load_model(selected_model, loader)
File "/workspace/text-generation-webui/modules/models.py", line 89, in load_model


output = load_func_map[loader](model_name)
File "/workspace/text-generation-webui/modules/models.py", line 259, in llamacpp_loader


model, tokenizer = LlamaCppModel.from_pretrained(model_file)
File "/workspace/text-generation-webui/modules/llamacpp_model.py", line 91, in from_pretrained


result.model = Llama(**params)
File "/usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/llama.py", line 923, in init


self._n_vocab = self.n_vocab()
File "/usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/llama.py", line 2184, in n_vocab


return self._model.n_vocab()
File "/usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/llama.py", line 250, in n_vocab


assert self.model is not None
AssertionError

You need to update Transformers on Runpod before launching it, I followed this tutorial : https://youtu.be/WjiX3lCnwUI?si=RnhYQR4eWWfeXCms&t=560
4x13B work on a single A100 using 96% of GPU with FP16, so use this.
For GGUF, I think last Ooba update work, with the last llama.cpp release, but I don't use GGUF in Ooba. Sorry!

tl;dr : If you use an A100 of runpod, use the unquantized files, it work!

Awesome! Thank you! Love the work you have been doing!

Sign up or log in to comment