k-quants possible?
They are better than Q4_0 and Q8_0
They should be, though I'd like to iron out some of the issues first before adding more features. The legacy quants were the easiest to implement.
You should be able to use llama.cpp's llama-quantize binary (built during a regular make) to do so. I haven't tried it yet as I'm not on bandwidth to download the 24GB FP16, but my read of the code implies it can be done without much hassle. The ggufs might be missing some metadata to help guide it to the right place though
https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L17605
They should be, though I'd like to iron out some of the issues first before adding more features. The legacy quants were the easiest to implement.
I wonder if imatrix would be possible here as well, possibly with a dataset consisting of images in a variety of styles.
You should be able to use llama.cpp's llama-quantize binary (built during a regular make) to do so. I haven't tried it yet as I'm not on bandwidth to download the 24GB FP16, but my read of the code implies it can be done without much hassle. The ggufs might be missing some metadata to help guide it to the right place though
https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L17605
But llama-quantize doesn't support non-LLM quantization
First iteration of K quants added. Will have to work on the logic for _M ones so only _S for now (all keys use the same K quant except some exceptions which use FP16 and small tensors which use FP32). Make sure to update the custom node to use them.