Tips on running on 3090 24gb VRAM?
I have played around with --pre_layer 10 up to 48, all have the same error:
Traceback (most recent call last):
File "/home/st/GIT/oobabooga_linux_2/text-generation-webui/modules/callbacks.py", line 55, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/home/st/GIT/oobabooga_linux_2/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2619, in sample
outputs = self(
File "/home/st/GIT/oobabooga_linux_2/text-generation-webui/modules/exllama_hf.py", line 57, in call
self.ex_model.forward(torch.tensor([seq[:-1]], dtype=torch.long), cache, preprocess_only=True, lora=self.lora)
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/exllama/model.py", line 860, in forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/exllama/model.py", line 466, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/exllama/model.py", line 377, in forward
query_states = self.q_proj.forward(hidden_states, lora)
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/exllama/model.py", line 195, in forward
out = cuda_ext.ext_q4_matmul(x, self.q4, self.width)
File "/home/st/GIT/oobabooga_linux_2/installer_files/env/lib/python3.10/site-packages/exllama/cuda_ext.py", line 49, in ext_q4_matmul
q4_matmul(x, q4, output)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
Output generated in 0.73 seconds (0.00 tokens/s, 0 tokens, context 9, seed 439549603)
I am not sure, but I thought pre layer doesn't work with ExLlama?
What value did you set max_seq_len and compress_pos_emb?
8192 context size currently doesn't fit into your 24gb vram, and using ---pre_layer probably doesn't work on ExLlama (but I might be wrong)
For 8K models (using a 4090), I've been using 4096 with ExLlama and compress_pos_emb set to 2
Try that...I will try this model soon as well, using that and report back.
Confirmed it works fine with the settings I gave above (on 4090, should be fine on 3090 too)
FYI I recently heard from someone that pre_layer does work on ExLlama - although the performance is pretty slow, understandably, so it might be better to use GGML with KoboldCpp instead.
I've noticed ExLlama is sometimes twice as fast as using other modes...at least with 30-33b model stuff...not sure how that could be considered slow...
The performance with pre_layer, which offloads part of the model to RAM instead of putting it on the GPU, will be slow. Like no more than 2-3 tokens/s. That is done when one wants to load a model larger than they have VRAM for.
Yes Exllama is much faster than other methods when no CPU/RAM offloading is done.
Yes, I notice this with 65B models despite having 96GB of DDR5 memory, but it's more like 0.6-1 tokens in my case :-(