You may want to add an "act-order" GPTQ quantization.
I'll start by saying I'm not against having a Triton GPTQ quant.
I'd really like to see --act-order --true-sequential --groupsize 128
versus --true-sequential --groupsize 128
. And I'd like to see that testing against 13B and 30B models. It's my understanding that perplexity gains are fairly parameter dependent for --act-order and the gains are around ~0.1 at 13B (please correct me if I'm wrong). Any additional info would be great. I'm also curious what the meaningful output shift results are of a 0.1 perplexity gain (I know people are saying act-order helps with rare contraction issues around words like couldn't), so if anyone has some good examples of how that 0.1 manifests I think that would be good to share around.
Any additional info would be great. I'm also curious what the meaningful output shift results are of a 0.1 perplexity gain (I know people are saying act-order helps with rare contraction issues around words like couldn't), so if anyone has some good examples of how that 0.1 manifests I think that would be good to share around.
Maybe @Monero /YellowRose#1776 could chime in and give some more info.
Those results are for the instruct version Pygmalion Metharme 7b, In my experience the best combination depends on the model
I'm not familiar with perplexity results beyond that atm, sorry!
No worries and thanks for compiling the data you did! Hopefully someone will compile meaningful differences in outputs at some point.
I have been watching various development efforts around really focusing on dropping perplexity as low as it'll go and giving back inference speed and RAM/VRAM for sake of 0.1 perplexity gains and it's got my dev brain wondering about what the specific wins are so people can make informed decisions about the value of various quants at various parameter counts. Right now it's all pretty loosey goosey which is fine, but those sorts of decisions might really start to matter when people are budgeting for non-hobby projects around these models and their quantized versions. If I'm a project lead, I want inference speed as high as possible on the least hardware possible if there's negligible difference between 6.0 and 5.9 perplexity. That's less applicable for GPTQ since Triton runs on Linux, where most non-hobby projects will live, but as ggml advances and gets GPU inference figured out, it could come into play with their giant menagerie of quantization formats.
I'll also dream of someone getting --act-order working on CUDA so this entire discussion becomes pointless for GPTQ quants. I messed with it a bit but I don't have enough free time to dig into it so didn't make much progress.
Act order works without group size on the ooba version of gptq. Just don't encode with the newest cuda branch as I think it changes the format yet again for the 3rd time.
So choices are either act order + true sequential or group size. Hence they get nonsense perplexity when they used them together.