Is there documentation for quantization alignment in long text?
Quantized on 300K tokens of two Vicuna format chats, a sci fi story and a fiction story at a long context. This should yield better storywriting performance than the default exl2 quantization.
You posted this at an interesting time. There is a discussion of just what it useful calibration data: https://github.com/ggerganov/llama.cpp/discussions/5006
As well as parallel discussions on Reddit and Discord.
In a nutshell, it appears that my strategy of "quantize on a lot of fiction" is possibly useless. Its not really worth documenting what I did because, as it turns out, its particularly bad for exllama below 4bpw. I would not recommend using this quantization, and instead lonestriker's generic quantixations for now.
Just a random update to this, I found my exl2s had very high perplexity at short context, but relatively low perplexity at long context.
Perhaps they were "overtuned" to long context.