Spaces:
Running
on
T4
Load 13B model with 8-bit/4-bit quantization to support more hardwares
Hi, LLaVA author here. Thank you for contributing the Huggingface space.
It would be better to keep the model version consistent with the official demo (13B). Quantization can be used to support more hardwares, see discussion here.
I have updated the support for quantization and necessary instructions on controlling the quantization bits by the environment variable bits
.
By default, it sets it to 8-bit to support running on A10G (this space). It can also be set to 4-bit to run on the smaller T4-medium (15G). The quantization bits for the current model will be indicated by the model name in the model selector dropdown.
Thanks.
You can load the model with 8-bit or 4-bit quantization to make it fit in smaller hardwares. Setting the environment variable bits
to control the quantization.
Recommended configurations:
Hardware | Bits |
---|---|
A10G-Large (24G) | 8 (default) |
T4-Medium (15G) | 4 |
A100-Large (40G) | 16 |
thank you
@liuhaotian
!
Previously, I also tried to use 4bits for it, but there is an issue stating that bitsandbytes
is not configured correctly in the docker environment of the space. So, it was not possible to use it, did you have a chance to test with the changes in this PR?
Yes I have tested that on a T4-medium here.
note the space is now compiling/downloading model as I am trying to see if we can skip the preload
part, but it works before this debugging (which is the one I commited).
Ah, also do you think the instructions above is taking too much of the vertical space? We may change that if it can be turned to look better.
tbh, i also dislike the preload part due to:
- very long build times
- not being able to cache it
but I mainly did it so that if the Gradio app is launched, then there should be "a model". if we remove the preload part it will just work too, since the worker will be downloading it in the background.
however, the user will see an empty dropdown with no information about the download status, and that felt like bad UX (open to discuss potential solutions :)
Also: I tried the docker option I was able to cache the downloads, but it couldn't find cuda, so I gave up on that.
Transposed version has more space but less intuitive, wdyt?
Recommended configurations:
Hardware | Bits |
---|---|
A10G-Large (24G) | 8 (default) |
T4-Medium (15G) | 4 |
A100-Large (40G) | 16 |
Hardware | A10G-Large (24G) | T4-Medium (15G) | A100-Large (40G) |
---|---|---|---|
Bits | 8 (default) | 4 | 16 |
it's looking great! updated the PR to adopt the transposed layout
thanks!
one bad thing about the preload..
after removing the preload, it works on the smallest t4-small ..
wow, thanks for trying that!
I will take a look into the model dropdown component for a "downloading status" so that the user will know about the model downloading process. and after that, we can remove the preload.
that sounds great, thank you!