chansung
/

alpaca-lora-65b

Model card Files Files and versions Community

4bit version

by KnutJaegersberg - opened Apr 19, 2023

Apr 19, 2023

I wonder if this thing can run on a beefy server cpu once quantized. @TheBloke What do you think, does this thing run on CPU in 4bit at useful speeds?

KnutJaegersberg

Apr 19, 2023

Ah I know... @Camelidae you have quantized llama 65 even to 2 bit. What are your experiences, performance wise? When does quantization do too much harm? On what hardware do you run the 3 and 4 bit versions?

KnutJaegersberg

Apr 19, 2023

tag does not work... gonna go to your page https://huggingface.co/camelids/llama-65b-ggml-q4_0

TheBloke

Apr 19, 2023

I'm having a look at it now, will let you know how I get in

TheBloke

Apr 19, 2023

OK, here are GGML 4bit and 2bit quantised versions: https://huggingface.co/TheBloke/alpaca-lora-65B-GGML

I had some problems making GPTQs due to my attempts killing the Runpod host I was on :) I got one GPTQ made, then the host died and I couldn't get back in to access the file I'd saved. I didn't want to start from scratch, and they've promised to fix it for tomorrow, so I will continue from where I left off then.

trahloc

Apr 20, 2023

@TheBloke Any suggestions on what project to get this 2bit version working? Oogabooga doesn't seem to like it and I'm doing this on a remote server as my local machine doesn't have an a100 or a decent connection. I only know how to get llama.cpp working locally and it'll take me a day to download it so any hints would be appreciated. I've been trying to get it working since shortly after your upload and keep slamming my face into brick walls.

TheBloke

Apr 20, 2023

•

edited Apr 20, 2023

I believe that right now the only way to get the 2bit working is to use the q2q3 branch of the llama.cpp fork I listed in the README.

I imagine that sometime soon this will be merged into the main llama.cpp. And once it's been merged, it'll likely only be a few days until it's supported by the various tools that interface with llama, like text-generation-webui and the python bindings. If you're a developer then I suppose it may be possible to merge the new q2q3 code into one of the llama.cpp interfaces to enable it to work as a server. But that's not something I've looked at, or plan to look at.

But right now the only simple option I know of would be to run it on the command line using:

git clone https://github.com/sw/llama.cpp llama-q2q3
cd llama-q2q3
git checkout q2q3
make
./main ...

If that doesn't work for you then I'd suggest just waiting. Once the q2 code is merged things will be easier. And it's quite possible that by that time, a better 65B model will be available anyway :)

KnutJaegersberg changed discussion status to closed Apr 20, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment