Beware! Requires new cuda.
Would not load in 0cc4m/ooba/etc.
New cuda is slower :(
which new cuda? The oobabooga GPTQ cuda branch?
That's the one that won't work.
this one:https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda will load it and be slow
huh
Yea, I'm looking for another 65b since I don't want to download 127gb of the fp32 model to convert myself. Will see if the maderix works. Unfortunately none of them have a merged alpaca lora.
Did you guys try alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors
? That was the file I made for old GPTQ ie ooba.
Yea.. it did not work. That is what I downloaded. If you made it with the current cuda branch...
What exactly is the issue? the no-act-order models I make normally work in ooba GPTQ
I will do some testing tomorrow.
It doesn't load and gives me a state dict error until I use the newest cuda branch.
Ah, you must be using CPU offload. Yes I've seen that problem with pre_layer specifically. I will look into it
Was able to run the model on 2 GPUS, 24GB each by using --gpu-memory 17 17.
Works well until context is about 1.1K tokens, then runs out of memory.
Nope.. no offload. P40/3090 :) I'll try it with autogptq and see if I get better perf now that it's fixed.
Are you splitting it across GPUs though? Maybe that causes the same issue as CPU offload. Ie not all on the same device
Yeah as you're into AutoGPTQ now just try that instead, and let me know
I am splitting it.. I got it running now. I wonder if it was old instances of gptq_llama being installed. I think I can only do half context and get about 1it/s, slightly over if I just do instruct.
I'd really like to try a 1024 group version to see if it would run full but you only have that for triton. 3090 can use triton but P40 cannot. Autogptq loads but can only do very small contexts because it loads lopsided.