Problems on runpod.io
What am I doing wrong? @TheBloke , I used your runpod template countless times and I always managed to run prompts but this time I fail. I'm using an A6000 instance on runpod with the thebloke/cuda11.8.0-ubuntu22.04-oneclick:latest image. I follow the instructions on the model card closely and use the prompt template. At first glance everything is fine, the model is loaded and 70% of the 48 GB VRam are used. When I hit generate there is no reaction besides my prompt is copied over to the output. There is also no activity visible on the CPU and GPU utilization gauges.
It's my fault, I've not updated that template yet for the latest ExLlama changes required for Llama 2 70B
Well, I did update it, but I never tested it and pushed it to :latest tag.
Could you test it for me? Edit your template or apply a template override, change the docker container to thebloke/cuda11.8.0-ubuntu22.04-oneclick:21072023
Then test again and let me know. If it works, I will push that to :latest
and then it will be the default with my template for all users.
Unfortunately, the behavior is still the same. Can it be correct that after model selection it is auto configured to AutoGPTQ with wbits set to none? I tried setting it to 4 and reloaded the model, but it didn't change anything. I also tried ExLlama but it didn't work.
You want to use ExLlama. It'll be much faster. I didn't realise you were using AutoGPTQ as most people use ExLlama these days.
AutoGPTQ can be used, but you have to tick "no inject fused attention" in the Loader settings. And yes it's correct that wbits is set to None for AutoGPTQ, leave that at None (it's automatically read from quantize_config.json)
So:
- please try Loader = ExLlama with the updated container and let me know if that works
- If you have time, I'd be grateful if you also tested Loader = AutoGPTQ with "no_inject_fused_attention" ticked (again with the updated container)
(make sure to click Reload Model after changing the Loader and any loader settings)
Well, I tried ExLlama first but it didn't work, then I read the instructions on the model card again and you wrote that it's auto configured by config file. So, I tried that, too and it configured the loader to AutoGPTQ. Okay, let me try again to be sure.
Do you know how to SSH in? Or use the Web Terminal?
If you do, can you do:
tail -100 /workspace/logs/*
and copy that output and paste it here
If not I will try to check it myself a little later
Iβm an avid Linux user, no problem. Unfortunately Iβm busy in the next couple hours. I looked for logs at /var/log. I didnβt know app logs go to /workspace.
Since Krassmann is busy I've tested as well, and I can confirm that thebloke/cuda11.8.0-ubuntu22.04-oneclick:21072023
does not work out of the box. Here are the contents of build-llama-cpp-python.log
:
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
This system supports AVX2.
Collecting llama-cpp-python
Downloading llama_cpp_python-0.1.74.tar.gz (1.6 MB)
ββββββββββββββββββββββββββββββββββββββββ 1.6/1.6 MB 13.2 MB/s eta 0:00:00
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'done'
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (4.7.1)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (1.24.4)
Requirement already satisfied: diskcache>=5.6.1 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (5.6.1)
Building wheels for collected packages: llama-cpp-python
Building wheel for llama-cpp-python (pyproject.toml): started
Building wheel for llama-cpp-python (pyproject.toml): finished with status 'done'
Created wheel for llama-cpp-python: filename=llama_cpp_python-0.1.74-cp310-cp310-linux_x86_64.whl size=1330178 sha256=5f451ec3e0600060c27bb8f82154947e461dd058485872b3cb4f332df5b54040
Stored in directory: /tmp/pip-ephem-wheel-cache-b6nmb0k4/wheels/e4/fe/48/cf667dccd2d15d9b61afdf51b4a7c3c843db1377e1ced97118
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.74
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python3 -m pip install --upgrade pip
Here are the contents of text-generation-webui.log
after trying both exllama and AutoGPTQ with no_inject_fused_attention
:
Launching text-generation-webui with args: --listen --api
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
Starting API at http://0.0.0.0:5000/api
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Downloading the model to models/TheBloke_FreeWilly2-GPTQ
100%|ββββββββββ| 7.02k /7.02k 19.4MiB/s
100%|ββββββββββ| 15.3k /15.3k 49.2MiB/s
100%|ββββββββββ| 4.77k /4.77k 15.1MiB/s
100%|ββββββββββ| 679 /679 2.67MiB/s
100%|ββββββββββ| 137 /137 535kiB/s
100%|ββββββββββ| 35.3G /35.3G 352MiB/s
100%|ββββββββββ| 183 /183 726kiB/s
100%|ββββββββββ| 411 /411 1.63MiB/s
100%|ββββββββββ| 1.84M /1.84M 3.67MiB/s
100%|ββββββββββ| 500k /500k 17.7MiB/s
100%|ββββββββββ| 649 /649 2.69MiB/s
100.64.0.24 - - [23/Jul/2023 13:40:21] code 404, message Not Found
100.64.0.24 - - [23/Jul/2023 13:40:21] "GET / HTTP/1.1" 404 -
100.64.0.24 - - [23/Jul/2023 13:40:27] code 404, message Not Found
100.64.0.24 - - [23/Jul/2023 13:40:27] "GET / HTTP/1.1" 404 -
100.64.0.25 - - [23/Jul/2023 13:40:33] code 404, message Not Found
100.64.0.25 - - [23/Jul/2023 13:40:33] "GET / HTTP/1.1" 404 -
100.64.0.25 - - [23/Jul/2023 13:40:39] code 404, message Not Found
100.64.0.25 - - [23/Jul/2023 13:40:39] "GET / HTTP/1.1" 404 -
2023-07-23 13:42:06 INFO:Loading TheBloke_FreeWilly2-GPTQ...
2023-07-23 13:42:11 INFO:Loaded the model in 5.12 seconds.
Traceback (most recent call last):
File "/workspace/text-generation-webui/modules/text_generation.py", line 331, in generate_reply_custom
for reply in shared.model.generate_with_streaming(question, state):
File "/workspace/text-generation-webui/modules/exllama.py", line 98, in generate_with_streaming
self.generator.gen_begin_reuse(ids)
File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 186, in gen_begin_reuse
self.gen_begin(in_tokens)
File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 171, in gen_begin
self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora)
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 849, in forward
r = self._forward(input_ids[:, chunk_begin : chunk_end],
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 930, in _forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 470, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 388, in forward
key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 14, 64, 128]' is invalid for input of size 14336
2023-07-23 13:51:34 INFO:Loading TheBloke_FreeWilly2-GPTQ...
2023-07-23 13:51:34 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit--1g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': False, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None, 'use_cuda_fp16': True}
2023-07-23 13:51:44 WARNING:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
2023-07-23 13:52:11 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
2023-07-23 13:52:11 WARNING:models/TheBloke_FreeWilly2-GPTQ/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-07-23 13:52:11 INFO:Loaded the model in 37.04 seconds.
Traceback (most recent call last):
File "/workspace/text-generation-webui/modules/callbacks.py", line 55, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/workspace/text-generation-webui/modules/text_generation.py", line 297, in generate_with_callback
shared.model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 423, in generate
return self.model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 195, in forward
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear_old.py", line 250, in forward
out = out + self.bias if self.bias is not None else out
RuntimeError: The size of tensor a (8192) must match the size of tensor b (1024) at non-singleton dimension 2
Running pip show exllama
makes it clear that exllama
is still on the old 0.0.5+cu117
version which does not support Llama-2. If I update this manually and then restart Ooba it works. So the main issue as least as far as exllama
seems to be that it is not updated automatically.
OK thanks, I will sort it out
OK it should now be fixed. thebloke/cuda11.8.0-ubuntu22.04-oneclick:latest
is updated, so the default Runpod templates should work fine now. I just tested myself with Llama-2-70B-Chat-GPTQ and it worked fine.
I confirm it's working now. Thank you all.