Smaller quant for 16GB?
Would it be possible to do a smaller quant? I'd really like to try :-)
Yeah. Is there a ideal size for 16G? 2.7?
I think I may have discussed this with you before, can't remember lol
Haha, yes I guess we did - here https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megamerge-v8-31bpw-exl2-fiction/discussions/1
There you provided 2.6 and 2.4, which was perfect, but back then both railed off right away, and you wanted to look into this and recommended the IQ2_XXS from TheBloke :-)
Not sure if you ever found out why that happened, but maybe that was a different method (without the imatrix?)
I remember now! Should be availible here:
https://huggingface.co/brucethemoose/Yi-34B-200K-RPMerge-exl2-2.67bpw
There you provided 2.6 and 2.4, which was perfect, but back then both railed off right away.
I actually did, sort of! So I quantized those models at 32K context. It turns out their low context perplexity was horrible, but once the context gets bigger (10K+) the model performance picked back up. Thats why testing and initial perplexity testing was so bad. Some discussion/numbers here, though there was more testing as well: https://huggingface.co/DrNicefellow/ChatAllInOne-Yi-34B-200K-V1/discussions/1#65be7f2db7db0ab0959cb859
Anyway I quantized the above file with vanilla exllama settings, so it shouldn't be disasterious at low context, I will make a 2.4 as well.
Maybe longer context version as well? I will see.