Question about requests
@RichardErkhov Hello, I'd like to ask if you take exl quant requests. Would probably ask requests between 7-20b. Apologies that I'm asking here since I wanted to make sure if you can take such requests. The reason I'm asking for exl type quants specifically is because I exclusively use free colab and exl is the best type for me to use cause of its model compression and fast token per second generation for limited colab.
@Clevyby Hello! I understand you. I would provide exl quants, but can't really get it working. If you can provide any help, like if you know how to create the quants (yes I know I can google, but the problem is mainly they provide either single version like 4.0bpw or they just provide a list of commands, which is not really helpful). So maybe if you have a working script to create the quants, I will try to run it
and will be even better if there was a way to quantize on the cpu for exl, because I don't have a lot of gpu power for mass quantization
@RichardErkhov From my understanding, there is a file called 'convert.py' you can use as a conversion script from the official github page of exllamav2, you can check doc on converting here . I'm kind of an amateur in these kind of matters but I'll see if I can help. What do you mean by it not working? Is there an error using it? Also here's a colab page on quantizing using exllamav2 for reference. And I'm quite confused by that you said they only provide a 4 bpw version. It's just converting a model into exl bpw quant.
and will be even better if there was a way to quantize on the cpu for exl, because I don't have a lot of gpu power for mass quantization
Oh I see. What's your specs? Also I'm just going to ask one quant of a model like one 4.25 bpw quant.
And I'm quite confused by that you said they only provide a 4 bpw version.
I meant they are just showing how to quantize one quant type, without showing the most popular types
What do you mean by it not working? Is there an error using it?
I had some cuda issues
Also here's a colab page on quantizing using exllamav2 for reference.
that's a very nice thing haha.
Do you have a list of most popular bpw?
and will be even better if there was a way to quantize on the cpu for exl, because I don't have a lot of gpu power for mass quantization
Oh I see. What's your specs? Also I'm just going to ask one quant of a model like one 4.25 bpw quant.
4x xeon, 1.2TB ram, random amount of storage (controller is doing some jokes with me, so sometimes half of the hard drives don't show up), 2x tesla m40 24gb, tesla t4
You definitely have enough compute power to quantize at least a 34b, in terms of showing one quant type, all bpw are pretty much one exl2 quant type, it can be replicable if you convert into 3.45, 2.22, or 7.1 bpw it's just defining a bpw value up to 2 decimal points.
And there's not exactly a 'popular' bpw value as choosing a exl2 quant's bpw is dependent on the capability of your specs, for instance I'm exclusively using free colab, so I only have access to 15 gb vram, as I want to use 8k context while using a 20b then the optimal bpw value for me is 4.25 bpw. Rule of thumb is that the minimum bpw for a decent exl2 model is at least 3.5 bpw. 2-3 bpw breaks the model unless you quantize using a custom calibration dataset.
For cuda issues, well.... I'm kind of an amateur, best advice I can give you is specify the issue at hand and delve into exllamav2's github issues to see if someone has similiar issue with yours.