Where GGUF?

#1

Yeah would be great if this could be applied to GGUF or EXL2 quantisation, GPTQ isn't very widely used anymore.

Hadn't used gptq for like a year. When I tried this in the latest ooba, it produced garbage characters lol

Owner

I am looking for the instruction and will give a try.

If anyone know how to transfer GPTQ models to GGUF or EXL2, please give me a help, Thank you!

Hmm, I wonder if there's a more/most-est universal format to transfer to? I've been trying to figure out how to run this with Silicon/Metal/MPS. The closest thing I'd found thus far is Mistral.rs -- the dev is lightning quick with updates it seems, and recently added in some GPTQ-type support FYI. IIRC, one would be able to convert the GPTQ model to GGUF/GGML, but if not, one can definitely run a GPTQ-quantized model on Mistral.rs, 2-bit, odd-bit, no problem...plus, it is Rust, which in itself means some boosts to performance/reliability!

Sadly, that support doesn't translate over to Metal just yet.

Metal/MPS can run w/ a Triton kernel now, so that is also another possibility if you really wanna open up your wonderful, awesome-sauce quants. to everyone πŸ˜‰πŸ€£ (just spitballing, was glazing over all this a week ago at this point, but I believe that is valid avenue as far as GPTQ-compatibility goes)

Owner

@BuildBackBuehler
Thanks for your interesting.

Recently, T-MAC from MicroSoft have support to run quantized model of EfficientQAT. Additionally, the reported speed is even faster than llama.cpp.

Hello @ChenMnZ ,

Can we run this quantized model with T-MAC or Mistral.rs? Have you tried them with this 2-bit model? Thanks!

Owner

@MLDataScientist The owner of T-MAC have tried this. You can refer https://github.com/OpenGVLab/EfficientQAT/issues/3#issuecomment-2298608707 for details.

Sign up or log in to comment