Will any 120b model currently fit on a single 24GB VRAM card through any app I can run on PC? (aka 4090)
Just asking before I waste my time. Thank you!
No, unfortunately not. Your best bet is to go with a 2.4bpw 70B mique exl2 quant instead. You might sqeeze in the 2.65bpw version, but you're already severely limited with context even with 2.4bpw. Make sure you reduce the max_seq_len to fit on your card:
https://huggingface.co/models?sort=trending&search=LoneStriker%2Fmiqu-1-70b
Thanks. Sadly I find 70b models to be way too slow...I can only imagine how slow a 120b model would be IF it even worked on my system.
I've gotten used to just using 20-23b models for the most part. I wish the tech existed to release exl2 models in 34b sizes, like we used to have for older non exl2 models in the past.
A 2.4bpw 70B model running on a 4090 should be crazy fast to run; the small context length and quality of inference would be lower though than if you could run a 4.0bpw model. There are good 34B models that run on exl2. Some of the Yi-based fine-tunes at 34B are decent; you can fit a 4.0bpw or lower model on a single 4090. The Mixtral 8x7B models are also very good, you can fit a 3.0bpw model in 24 GB of VRAM. There are also good 2x10b merges of the SOLAR-10b model that fit in a single 4090. So, lots of choices and good models for your setup.
What does the bpw even mean? The higher the smarter?
Bits per weight. How many bits are used on average to quantize the weights. The higher the number, the lower error/divergence from the base fp16 model (the base would be considered "16.0 bpw"). Extreme quantization will effectively lobotomize the model and make it "dumber". But, you can go surprisingly low and still have a coherent model for things like creative writing and storytelling (2.4 bpw for example.)
Thanks. Sadly I find 70b models to be way too slow...I can only imagine how slow a 120b model would be IF it even worked on my system.
I've gotten used to just using 20-23b models for the most part. I wish the tech existed to release exl2 models in 34b sizes, like we used to have for older non exl2 models in the past.
Tons of models work with exl2. I use GPTQ models with Exl2 setting. I find very few 70b models will fit my 3090. I've been playing with 8x7b models with lower max_seq_len set at 16k and getting around 25 tokens a sec.
YAY! After reinstalling Nvidia, CUDA and even CUDDN, performance is way better.
The 34B and 70b model @2.4bpw I was testing at context length 4096, now produces results at j19.6 - 27.13 tokens/s
With LoneStriker_Noromaid-v0.1-mixtral-8x7b-Instruct-v3-2.4bpw-h6-exl2" (WOW @context 32k!!) I get 47-52 Tokens/sec!
With a 7B Silicon-Maid model I get: 32-35 tokens/sec (only that for a 7b model?...interesting...)
With a 23B model I get: 23-26 tokens/sec
with 20b models (most of my models) , for example my usual favorite: Kooten_PsyMedRP-v1-20B-4bpw-h8-exl2, I get: 34-40 tokens./sec
or a 20b model at 6bpw gives about: 19-25 tokens/sec
Only one model I have KILLS it, and that is: TeeZee_Kyllene-57B-v1.0-bpw3.0-h6-exl2 (context 4096)
I get frustratingly slow performance from this one: 1.11 tokens/sec
But I'm happy it mostly works great now!!
Just a post for anyone who sees this...I WAS able to run LoneStriker_goliath-120b-2.18bpw-h6-exl2 on my 4090 card, but that one is only 2.18BPW so that is likely why that is the case.
Not fast by any means though, lol
Just a post for anyone who sees this...I WAS able to run LoneStriker_goliath-120b-2.18bpw-h6-exl2 on my 4090 card, but that one is only 2.18BPW so that is likely why that is the case.
Not fast by any means though, lol
Curious on how much system ram you have? And what are you using to run the models? I tried on my 3090 and 48 GB of ram, and it eats all of it up on Oobabooga. Thanks in advance !
I have 96GB (DDR5) RAM, I guess that is why it works?
More than likely, you could check how much Ram you have free before and after loading the model.
I tested this. My system (Win11) was using 30GB of RAM when I started loading it and it got up to 70GB of RAM, so 40GB...I would sat you are very close to being able to load it, but your OS is no doubt using up more than 8GB of RAM, hence the issue...
Also with it loaded intro memory but not using it at all, my used system memory is staying at 41.8GB
Also loaded with 8-bit cache enabled, via EXLLAMA_V2-HF
But I'll warn you it is basically unusable-slow... I'm waiting several seconds PER word usually...and it stops sometimes responding. I think it's cool it loads at all, but it's not really something you will enjoy until the technology gets better somehow.