4bpw
Hi, it's possible for you to make a exactly 4bpw quant?
Sure, what are you trying to fit it on that 4bpw would fit better?
rtx 4060 ti 16gb
I'm currently using a 4bpw version of the base Buttercup model and it fits perfectly on my card with the max context(32k)
Hmm that seems surprising, from my math 32k context with 4bpw should take ~16.7 GB, but i'll make it and check if i'm calculating wrong
@Nephilim 4.0 is up: https://huggingface.co/bartowski/Buttercup-4x7B-V2-laser-exl2/tree/4_0
let me know if it works and what your final usage looks like, if it makes more sense for a 16gb card i'll add it for future quants of this size
Oh, thanks, I will test it
Worked very well here, thanks again.
100% sure, i've disabled system fallback, it runs on ~10 tokens/it here
fascinating... what's your setup, i wonder if TGWUI adds some overhead that i don't realize
i'm using the latest version of oobabooga, with the 8bit cache option enabled
ahhhhh 8 bit cache explains it!