Feedback
I had an immense amount of fun testing SorcererLM.
Your quants are neat (I used IQ3_S), hence my comment here.
It takes a bit of tweaking to find the right sampling settings, but at very low temp it is absolute delirious. It turned an heroic fantasy adventure into a R-Rated circus... I laughed so hard, and I still do just thinking about it.
LLMs are meant to be fun, so thank you guys for this!
Hey @Nexesenex , thanks for the kind words! Passed this along to @rAIfle as well. Elated to hear you had as much fun testing as we did. Also appreciate your feedback on the samplers - it can be somewhat tricky to dial them in these days with so many new ones like DRY popping up. We're definitely due for a revamping of the recommended settings soon.
PS: Huge fan of your work pioneering and popularizing the usage of importance matrices in quantization. The use of 'iMat' in the nomenclature always made perfect sense to me too. If you ever need a second set of eyes testing new experimental quants feel free to reach out anytime.
Thanks!
Ikawrakow's work on quantizations and iMatrixes were indeed to be popularized!
I was just thinking about making a custom quant of your model to reach 50k context with the best quality my 64GB of VRAM can afford, like I did for Mixtral 8x22b, I'd be glad if you were to share your iMatrix on the repo.
Know also that you can iMatrix from a Q4_0, with a minimal loss of quality (less than +0.01 ppl on the final quant), and with 0.001 ppl loss if you use Q8_0), if there's no other way around.
As for the 8x22b models, you can try easily to scale down ffn_up on the immediate lower quant compared to ffn_gate and down, that's what I did to reach 50k context with Mixtral 8x22b with a minimal loss of quality, and what I will do on yours. You can either do that by modifying the quant strategies in the llama-cpp files, or by using my PR to do that in CLI :
https://github.com/ggerganov/llama.cpp/pull/8917/files
I downloaded your settings for SorcererLM, then made my own, there are a good base and a very good practice because samplers can be very complex for casual users (is anyone casual while running Mixtral 8x22b? :D)
Absolutely, major props to Ikawrakow as well for all the continued work on the i-quants. And that sounds amazing - just uploaded the imatrix file in the repo. Eager to see improvement to the quality on the context front!
Great pointers around the importance matrix calculation as well. In the past I've aimed for using Q8_0 to generate the importance matrix but more recently switched to fp16 / occasionally fp32 now. While I've noticed that the perplexity and even KLD are nearly identical in testing (and personally I have not seen a difference), it's not much trouble to run it at the larger size these days so figured if it saves others the hassle why not.
Thank you for the link to the PR and explaining your thought process too. Very interested in testing this across some other models where the RULER score could have been a little better.
Glad that the settings were a decent base at least, always some tinkering to do with templates and samplers but that's the fun of local. Sure there's a slight learning curve but also a big payoff in having more control over the output :D