Flux.1-dev in a few experimental custom formats, mixing tensors in Q8_0, fp16, and fp32. Converted from black-forest-labs' original bf16 weights.

Motivation

Flux's weights were published in bf16. Conversion to fp16 is slightly lossy, but fp32 is lossless. I experimented with mixed tensor formats to see if it would improve quality.

Evaluation

I tried comparing the outputs but I can't say with any certainty if these models are significantly better than pure Q8_0. You're probably better off using Q8_0, but I thought I'll share these – maybe someone will find them useful.

Higher bits per weight (bpw) numbers result in slower computation:

 20 s  Q8_0
 23 s  11.024bpw-txt16.gguf
 30 s  fp16
 37 s  16.422bpw-txt32.gguf
310 s  fp32

Update 2024-08-26

Two new files. This time the only tensors in Q8_0 are some or all of:

double_blocks.*.img_mlp.0.weight
double_blocks.*.img_mlp.2.weight
double_blocks.*.txt_mlp.0.weight
double_blocks.*.txt_mlp.2.weight

double_blocks.*.img_mod.lin.weight
double_blocks.*.txt_mod.lin.weight
single_blocks.*.linear1.weight
single_blocks.*.linear2.weight
single_blocks.*.modulation.lin.weight

flux1-dev-Q8_0-fp32-11.763bpw.gguf
This version has all the above layers in Q8_0.
flux1-dev-Q8_0-fp32-13.962bpw.gguf
This version preserves first 2 layers of all kinds, and first 4 MLP layers in fp32.
flux1-dev-Q8_0-fp32-16.161bpw.gguf
This one, first 4 layers of any kind and first 8 MLP layers in fp32.

In the txt16/32 files, I quantized only these layers to Q8_0, unless they were one-dimensional:

img_mlp.0
img_mlp.2
img_mod.lin
linear1
linear2
modulation.lin

But left all these at fp16 or fp32, respectively:

txt_mlp.0
txt_mlp.2
txt_mod.lin

The resulting bpw number is just an approximation from file size.

This is a direct GGUF conversion of black-forest-labs/FLUX.1-dev

As this is a quantized model not a finetune, all the same restrictions/original license terms still apply.

The model files can be used with the ComfyUI-GGUF custom node.

Place model files in ComfyUI/models/unet - see the GitHub readme for further install instructions.

Please refer to this chart for a basic overview of quantization types.

(Model card mostly copied from city96/FLUX.1-dev-gguf - which contains conventional and useful GGUF files.)

Downloads last month: 285

GGUF

Model size

12B params

Architecture

flux

Hardware compatibility

8-bit

View +1 variant

Model tree for mo137/FLUX.1-dev_Q8-fp16-fp32-mix_8-to-32-bpw_gguf

Base model

black-forest-labs/FLUX.1-dev

Quantized

(57)

this model

Adapters

1 model