Which of these 34B model BPW will fit on a single 24GB card's (4090) VRAM?

by clevnumb - opened Nov 17, 2023

Discussion

clevnumb

Nov 17, 2023

Can I fit a 5bpw or 6bpw on a single 4090?

What is the highest one and what context ranges? I can't find this info anywhere.

Thank you.

LoneStriker

Owner Nov 17, 2023

You can estimate the VRAM requirements by just looking at the size of the model files themselves. The 4.0bpw models will comfortably fit on a single 4090. 5.0bpw will likely not fit, however, I just tried loading on an empty 4090 and it runs OOM. 5.0 - 8.0 bpw will need two cards. I should probably generate a 4.65 bpw model to get slightly more bits on a model that fits in a single 4090.

clevnumb

Nov 17, 2023

•

edited Nov 17, 2023

That would be awesome if you could do that...

Thank you for the detail. Is this model when used by such a system best loaded at a 4096 context? 1.75 alpha value? (I have 96GB of RAM, not sure if that matters, beyond loading the model in)

LoneStriker

Owner Nov 17, 2023

I have not used this model extensively. Your RAM won't matter other than allowing you to load the model into your GPU's VRAM. If you stick with the default 4096 context length, no alpha value needed. If you extend higher, you'll need to start adjusting the alpha; there's a formula for alpha, but the ooba slider may not actually let you set it high enough with default limits. This model supposedly supports up to 200K context, but I've not had great success with long-context models in the past.

clevnumb

Nov 17, 2023

Thanks, I'll play around with it. I've never gotten anything working well beyond 4096 at all... :-\

eepos

Nov 17, 2023

•

edited Nov 17, 2023

I can load a 5.0bpw model on a 4090 at 4096 context length and it generates at good speed. This is on Windows 11 too so OS and browser are consuming more VRAM than on Linux (I assume). Note that 8-bit cache does wonders for VRAM consumption:

As you can see, it's very close to VRAM limit. I have experienced a slowdown due to Nvidia driver directing the model to RAM slowing things down when I had too many browser tabs open... :) But in my example image this isn't happening as I have no other tabs/programs consuming VRAM.

LoneStriker

Owner Nov 17, 2023

Thanks for the info. For future 34B quants, I've added 4.65 bpw version, first one added here for my fine-tune of Yi-34B with Spicyboros dataset.

eepos

Nov 17, 2023

Yep, 4.65bpw is good as it leaves a bit of wiggle room. Thanks for all the exl2 quants btw, have downloaded many!

clevnumb

Nov 18, 2023

Yes thanks a lot for all the hard work!

EEPOS: Does setting this to 8-bit cache lower the quality of the output? I mean, any drawbacks to having that enabled?

LoneStriker

Owner Nov 18, 2023

According to Turboderp, there's no quality loss. You save VRAM and trade off a bit of inference speed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment