MikeRoz/TheDrummer_Cydonia-22B-v1-8.0bpw-h8-exl2 · Memory usage vs LoneStriker's Mistral Small 8.0bpw-h8-exl2

I can confirm what you're seeing - with my own quant of Mistral Small that I didn't upload (there were already several available by the time I got to it). My version of Mistral-Small-Instruct-2409-8.0bpw-h8-exl2 loads with 23490MiB reported in use by nvtop with 16384 context, and this model (TheDrummer_Cydonia-22B-v1-8.0bpw-h8-exl2) fails to load with 16384 context on the same GPU. If I let it use 2 GPUs, it will use 23650MiB on device 0 and 2060MiB on device 1. I used the same settings and same version of Exllamav2 for both quants. Perhaps the different values in the weights simply compress differently?

Just for fun, I decided to see what would happen if I loaded the unquantized models. Both used exactly the same amount of memory, as one would expect.