Which of these 34B model BPW will fit on a single 24GB card's (4090) VRAM?
Can I fit a 5bpw or 6bpw on a single 4090?
What is the highest one and what context ranges? I can't find this info anywhere.
Thank you.
You can estimate the VRAM requirements by just looking at the size of the model files themselves. The 4.0bpw models will comfortably fit on a single 4090. 5.0bpw will likely not fit, however, I just tried loading on an empty 4090 and it runs OOM. 5.0 - 8.0 bpw will need two cards. I should probably generate a 4.65 bpw model to get slightly more bits on a model that fits in a single 4090.
That would be awesome if you could do that...
Thank you for the detail. Is this model when used by such a system best loaded at a 4096 context? 1.75 alpha value? (I have 96GB of RAM, not sure if that matters, beyond loading the model in)
I have not used this model extensively. Your RAM won't matter other than allowing you to load the model into your GPU's VRAM. If you stick with the default 4096 context length, no alpha value needed. If you extend higher, you'll need to start adjusting the alpha; there's a formula for alpha, but the ooba slider may not actually let you set it high enough with default limits. This model supposedly supports up to 200K context, but I've not had great success with long-context models in the past.
Thanks, I'll play around with it. I've never gotten anything working well beyond 4096 at all... :-\
I can load a 5.0bpw model on a 4090 at 4096 context length and it generates at good speed. This is on Windows 11 too so OS and browser are consuming more VRAM than on Linux (I assume). Note that 8-bit cache does wonders for VRAM consumption:
As you can see, it's very close to VRAM limit. I have experienced a slowdown due to Nvidia driver directing the model to RAM slowing things down when I had too many browser tabs open... :) But in my example image this isn't happening as I have no other tabs/programs consuming VRAM.
Thanks for the info. For future 34B quants, I've added 4.65 bpw version, first one added here for my fine-tune of Yi-34B with Spicyboros dataset.
Yep, 4.65bpw is good as it leaves a bit of wiggle room. Thanks for all the exl2 quants btw, have downloaded many!
Yes thanks a lot for all the hard work!
EEPOS: Does setting this to 8-bit cache lower the quality of the output? I mean, any drawbacks to having that enabled?
According to Turboderp, there's no quality loss. You save VRAM and trade off a bit of inference speed.