Quants / Compression / INT4

by deleted - opened Aug 16

deleted

Aug 16

Good evening (or whenever you're reading this)

First of all, I wanted to thank you for this model. From what I've seen, models with this quality require either very much compute (like FLUX.1-schnell finetunes), or they simply are licensed under a non-permissive license. For me personally, this model is very unique.

With the provided example code, I was able to get a small WebUI running which allowed me to experiment with different parameters, like width, height, sampling steps, ... . Since I have a 8GB card (the 4060), running the model under recommended width * height values took very long to compute, as torch automatically used ~2GB of unified CUDA memory. Generating images with something like 512 * 512 worked very fast as it mostly stayed inside the local VRAM, not using unified memory.

I tried to get it running using quants (something like INT8 or INT4) for around five hours now, and I unfortunately had no success, and since Python is not my main language (I'm more experienced with C/C++, I switched some years back as I was annoyed with Python's dependencies), I did not try to try something "new" which is not yet documented under torch, the HuggingFace libraries, and other. I've tried converting the model to GGUF formats, to run the raw .safetensors using koboldcpp, and bitsandbytes (load_in_8bit=True), but all with little success.

I understand that you're under no obligation to quantize this model, and it's already very generous that you're providing a model of that quality under this license, but I would like to ask for some advice or quants I may have overlooked (maybe on CivitAI or other platforms?). I'm specifically running Windows 10 with CUDA V11.9, and, as mentioned, I run a NVIDIA 4060 8GB with 22GB unified DDR4 memory, and 48GB DDR4 RAM in total.

Thanks again, and have a great day.

kayfahaarukku

Cagliostro Research Lab org Aug 16

Good morning from Indonesia.

Unfortunately, we're not experts in quantization. And I think most SDXL models still use either bf16 or fp16 like us. However, looking at your specs, I think you can run the model faster with reForge. My friend was able to generate an 832*1216 image using 28 steps in only 50 seconds on an RTX 3050 Laptop with 4GB of VRAM, which is a generation behind yours and has half your VRAM.

Please give it a try and tell me if it helps.

Thank you.

deleted

Aug 16

Thanks for the quick reply. I will definitely give it a try.
From the GitHub' readme, I can already see it has options for F8 instead of BF16, and it automatically uses the xformers library.
I'm currently installing the dependencies, and I will leave another comment with what worked for me in case others are experiencing performance problems as well.

deleted

Aug 16

Hello again.

Unfortunately, I was not able to run the reForge (WebUI) you recommended, but I'm still grateful for the suggestion.
However, after some experimenting, I was able to build a C++ project and convert the animagine-xl-4.0-opt.safetensorsin to both Q4_0 and Q8_0 GGUF files, reducing the memory footprint from ~10GB to ~4GB and 6GB, respectively. I will upload those GGUF files in a repo here, on HuggingFace, shortly, for anyone facing similar performance and VRAM issues. These quants allow for a 1024^2 pixel image to generate in around one minute, as you states for reForge (suggesting it also uses some form of quantization). You can see other logs below.

For all generations

the 4060 8GB card
a step size of 28
a seed of 0
the prompt appendix (from your README.md) after the actual prompt
the negative prompt appendix
guidance of 5
and Euler A was used.

INT4 / Q4:

Memory Usage: ~3.8GB
Prompt: "a large tree in a beautiful cyberpunk city"
Processing time: ~1min 2s
Width * Height: 1024 * 1024
Result: image.png

Memory Usage: ~3.6GB
Prompt: "a large tree in a beautiful cyberpunk city"
Processing time: ~49s
Width * Height: 1152 * 896
Result: image.png

Memory Usage: ~3.6GB
Prompt: "a large tree in a beautiful cyberpunk city"
Processing time: ~34s
Width * Height: 896 * 1152
Result: image.png

INT8 / Q8:

Memory Usage: ~5.2GB
Prompt: "a large tree in a beautiful cyberpunk city"
Processing time: ~46s
Width * Height: 1024 * 1024
Result: image.png

Memory Usage: ~5.2GB
Prompt: "a large tree in a beautiful cyberpunk city"
Processing time: ~49s
Width * Height: 1152 * 896
Result: image.png

Memory Usage: ~5.2GB
Prompt: "a large tree in a beautiful cyberpunk city"
Processing time: ~34s
Width * Height: 896 * 1152
Result: This image may be considered NSFW, please open only if you're comfortable with it

Please note that quality of the images do not represent ALL use cases; the model is obviously trained for character-specific generation, not trees.
Values are only for performance benchmarking on a mid-range GPU.

As you can see, at least for an RTX4060 8GB, both the INT4 and INT8 version stay under the 8GB VRAM limits, and both quants run at around the same speed even though INT8 is ~1GB larger.
I'm no expert, but I'd recommend the INT4 version for 4GB cards, and INT8 for 8GB cards since the generation speed difference is almost unnoticeable but latter one provides much more quality.

Thanks for the help!

kayfahaarukku

Cagliostro Research Lab org Aug 16

Oops, forgot to tell you that reForge need Python 3.11.xx. You have to create the venv with this specific Python version manually, because 3.12 and 3.13 are not supported.

Anyways, can't wait to try the quants you made!

deleted

Aug 16

My oversight, I think I read it somewhere in the README.md of it's GitHub.
Learning something new (most) doesn't harm though, funny experience.

deleted changed discussion status to closed Aug 16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment