Memory usage vs LoneStriker's Mistral Small 8.0bpw-h8-exl2

#1
by finding1 - opened

With a 24 GB graphics card I can load LoneStriker/Mistral-Small-Instruct-2409-8.0bpw-h8-exl2 with max-seq-len 16384 and cache-mode FP16 but for this quant I have to dial back max-seq-len to 13312. I would have thought the requirements would be the same. Do you have any insight why the memory usage is different?

I can confirm what you're seeing - with my own quant of Mistral Small that I didn't upload (there were already several available by the time I got to it). My version of Mistral-Small-Instruct-2409-8.0bpw-h8-exl2 loads with 23490MiB reported in use by nvtop with 16384 context, and this model (TheDrummer_Cydonia-22B-v1-8.0bpw-h8-exl2) fails to load with 16384 context on the same GPU. If I let it use 2 GPUs, it will use 23650MiB on device 0 and 2060MiB on device 1. I used the same settings and same version of Exllamav2 for both quants. Perhaps the different values in the weights simply compress differently?

Just for fun, I decided to see what would happen if I loaded the unquantized models. Both used exactly the same amount of memory, as one would expect.

Sign up or log in to comment