Is it possible to create a smaller version like 8bit or fp16?

#1
by BriggoBoy - opened

Would be great.

Assuming this has an architecture like SD3, it would also be nice to have a version separate from the text encoders, so that files for finetunes and so on can be much smaller.

I third these statements. Its a super good model it really follows the prompt exactly BUT man is it large lol.

@BriggoBoy @sanguivore @colinw2292
diffusers supports quantization now, https://huggingface.co/blog/quanto-diffusers

Memory savings are very impressive. For example
PixartSigma1024x1024(a dit model similar to auraflow) usually takes 12gb vram, however with 8bit quantization of the transformer model and text encoder, it just takes 5gb vram! The above also supports Auraflow, SD3, and a few others.

Even just splitting the model into its parts might help as SD3 is similar in size but by far smoother and faster running. Could have to do with specifics I am not familiar with of course, but still. Every image I generate, I feel like my gpu is producing x-rays

@RustyRuins The actual image generation part of sd3 is just 2b. Its actually smaller then sdxl. The reason why sd3 seems so big is because its t5 text encoder. However since you only need to use the t5 text encoder once to generate a image and most of the work is done by the image generation part, it is pretty fast and faster then sdxl.

On the other hand, the actual image generation part of AuraFlow is roughly 6b. Auraflow's t5 text encoder is a lot smaller then sd3's. Since most of the work is done by the image generation part, its a lot slower then sd3 and sdxl.

I would highly recommend doing torch compile with AuraFlow and sd3 since that will boost speed by a whopping 4x! You could maybe, also use quantization as I showed above. To even save more memory, I believe you can use taesdxl vae(https://huggingface.co/madebyollin/taesdxl) which should save like 2-3gb vram. Maybe with everything combined, you could probably run AuraFlow with 8gb vram(maybe).

@YaTharThShaRma999 Thank you for the explanation, it makes sense now.

I am running everything with ComfyUI, so I am not sure about the torch compile part, or what it actually entails.
But I did try taesdxl and it did not work right, resulted in heavy artifacting. This vae (https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) however works perfectly and also claims to be a lot more resource-efficient.

@RustyRuins Could you share your workflow?

@BriggoBoy Sure thing. Do you mind custom node packs? The workflows I usually build and publish use a fair few, mostly popular ones most people have already anyway, but also some I find interesting and want to support by putting them in.
Or would rather want a bare-bones version with default nodes as far as possible?
The workflow is not quite done yet in full as I barely got started.

@RustyRuins I do not mind custom node packs, and take your time. Thanks in advance!

@RustyRuins thank you very much!

Sign up or log in to comment