LibreFLUX / README.md
jimmycarter's picture
Upload README.md
af3b871 verified
metadata
license: apache-2.0
library_name: diffusers
pipeline_tag: text-to-image

LibreFLUX: A free, de-distilled FLUX model

LibreFLUX is an Apache 2.0 version of FLUX.1-schnell that provides a full T5 context length, uses attention masking, has classifier free guidance restored, and has had most of the FLUX aesthetic fine-tuning/DPO fully removed. That means it's a lot uglier than base flux, but it has the potential to be more easily finetuned to any new distribution. It keeps in mind the core tenets of open source software, that it should be difficult to use, slower and clunkier than a proprietary solution, and have an aesthetic trapped somewhere inside the early 2000s.

The image features a man standing confidently, wearing a simple t-shirt with a humorous and quirky message printed across the front. The t-shirt reads: "I de-distilled FLUX schnell into a slow, ugly model and all I got was this stupid t-shirt." The man’s expression suggests a mix of pride and irony, as if he's aware of the complexity behind the statement, yet amused by the underwhelming reward. The background is neutral, keeping the focus on the man and his t-shirt, which pokes fun at the frustrating and often anticlimactic nature of technical processes or complex problem-solving, distilled into a comically understated punchline.

Table of Contents

Usage

Inference

To use the model, just call the custom pipeline using diffusers. It currently works with diffusers==0.30.3 and will be updated to the latest diffusers soon. The model works best with a CFG scale of 2.0 to 5.0, so if you are getting images with a blur or strange shadows try turning down your CFG scale (guidance_scale in diffusers). Alternatively, you can also use higher CFG scales if you turn it off during the first couple of timesteps (no_cfg_until_timestep=2 in the custom pipeline).

# ! pip install diffusers==0.30.3
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "jimmycarter/LibreFLUX",
    custom_pipeline="jimmycarter/LibreFLUX",
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# High VRAM
prompt = "Photograph of a chalk board on which is written: 'I thought what I'd do was, I'd pretend I was one of those deaf-mutes.'"
negative_prompt = "blurry"
images = pipe(
  prompt=prompt,
  negative_prompt=negative_prompt,
  return_dict=False,
  # guidance_scale=3.5,
  # num_inference_steps=28,
  # generator=torch.Generator().manual_seed(42),
  # no_cfg_until_timestep=0,
)
images[0][0].save('chalkboard.png')

# If you have <=24 GB VRAM, try:
# ! pip install optimum-quanto
# Then
from optimum.quanto import freeze, quantize, qint8
# quantize and freeze will take a short amount of time, so be patient.
quantize(
    pipe.transformer,
    weights=qint8,
    exclude=[
        "*.norm", "*.norm1", "*.norm2", "*.norm2_context",
        "proj_out", "x_embedder", "norm_out", "context_embedder",
    ],
)
freeze(pipe.transformer)
pipe.enable_model_cpu_offload()

images = pipe(
  prompt=prompt,
  negative_prompt=negative_prompt,
  device=None,
  return_dict=False,
  do_batch_cfg=False, # https://github.com/huggingface/optimum-quanto/issues/327
  # guidance_scale=3.5,
  # num_inference_steps=28,
  # generator=torch.Generator().manual_seed(42),
  # no_cfg_until_timestep=0,
)
images[0][0].save('chalkboard.png')

For usage in ComfyUI, a single transformer file is provided but note that ComfyUI does not presently support attention masks so your images may be degraded.

Fine-tuning

The model can be easily finetuned using SimpleTuner and the --flux_attention_masked_training training option and the model found in jimmycarter/LibreFlux-SimpleTuner. This is the same model with the custom pipeline removed, which currently interferes with the ability for SimpleTuner to finetune with it. SimpleTuner has extensive support for parameter-efficient fine-tuning via LyCORIS, in addition to full-rank fine-tuning. For inference, use the custom pipline from this repo and follow the example in SimpleTuner to patch in your LyCORIS weights.

from lycoris import create_lycoris_from_weights

pipe = DiffusionPipeline.from_pretrained(
    "jimmycarter/LibreFLUX",
    custom_pipeline="jimmycarter/LibreFLUX",
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

lycoris_safetensors_path = 'pytorch_lora_weights.safetensors'
wrapper, _ = create_lycoris_from_weights(1.0, lycoris_safetensors_path, pipe.transformer)
wrapper.merge_to()
del wrapper

prompt = "Photograph of a chalk board on which is written: 'I thought what I'd do was, I'd pretend I was one of those deaf-mutes.'"
negative_prompt = "blurry"
images = pipe(
  prompt=prompt,
  negative_prompt=negative_prompt,
  return_dict=False,
)
images[0][0].save('chalkboard.png')

# optionally, save a merged pipeline containing the LyCORIS baked-in:
# pipe.save_pretrained('/path/to/output/pipeline')

Non-technical Report on Schnell De-distillation

Welcome to my non-technical report on de-distilling FLUX.1-schnell in the most un-scientific way possible with extremely limited resources. I'm not going to claim I made a good model, but I did make a model. It was trained on about 1,500 H100 hour equivalents.

Everyone is an artist a machine learning researcher.

Why

FLUX is a good text-to-image model, but the only versions of it that are out are distilled. FLUX.1-dev is distilled so that you don't need to use CFG (classifier free guidance), so instead of making one sample for conditional (your prompt) and unconditional (negative prompt), you only have to make the sample for conditional. This means that FLUX.1-dev is twice as fast as the model without distillation.

FLUX.1-schnell (German for "fast") is further distilled so that you only need 4 steps of conditional generation to get an image. Importantly, FLUX.1-schnell has an Apache-2.0 license, so you can use it freely without having to obtain a commercial license from Black Forest Labs. Out of the box, schnell is pretty bad when you use CFG unless you skip the first couple of steps.

The FLUX distilled models are created for their base, non-distilled models by training on output from the teacher model (non-distilled) to student model (distilled) along with some tricks like an adversarial network.

For de-distilled models, image generation takes a little less than twice as long because you need to compute a sample for both conditional and unconditional images at each step. The benefit is you can use them commercially for free, training is a little easier, and they may be more creative.

Restoring the original training objective

This part is actually really easy. You just train it on the normal flow-matching objective with MSE loss and the model starts learning how to do it again. That being said, I don't think either LibreFLUX or OpenFLUX.1 managed to fully de-distill the model. The evidence I see for that is that both models will either get strange shadows that overwhelm the image or blurriness when using CFG scale values greater than 4.0. Neither of us trained very long in comparison to the training for the original model (assumed to be around 0.5-2.0m H100 hours), so it's not particularly surprising.

FLUX and attention masking

FLUX models use a text model called T5-XXL to get most of its conditioning for the text-to-image task. Importantly, they pad the text out to either 256 (schnell) or 512 (dev) tokens. 512 tokens is the maximum trained length for the model. By padding, I mean they repeat the last token until the sequence is this length.

This results in the model using these padding tokens to store information. When you visualize the attention maps of the tokens in the padding segment of the text encoder, you can see that about 10-40 tokens shortly after the last token of the text and about 10-40 tokens at the end of the padding contain information which the model uses to make images. Because these are normally used to store information, it means that any prompt long enough to not have some of these padding tokens will end up with degraded performance.

It's easy to prevent this by masking out these padding token during attention. BFL and their engineers know this, but they probably decided against it because it works as is and most fast implementations of attention only work with causal (LLM) types of padding and so would let them train faster.

I already implemented attention masking and I would like to be able to use all 512 tokens without degradation, so I did my finetune with it on. Small scale finetunes with it on tend to damage the model, but since I need to train so much out of distillation schnell to make it work anyway I figured it probably didn't matter to add it.

Note that FLUX.1-schnell was only trained on 256 tokens, so my finetune allows users to use the whole 512 token sequence length.

Make de-distillation go fast and fit in small GPUs

I avoided doing any full-rank (normal, all parameters) fine-tuning at all, since FLUX is big. I trained initially with the model in int8 precision using quanto. I started with a 600 million parameter LoKr, since LoKr tends to approximate full-rank fine-tuning better than LoRA. The loss was really slow to go down when I began, so after poking around the code to initialize the matrix to apply to the LoKr I settled on this function, which injects noise at a fraction of the magnitudes of the layers they apply to.

def approximate_normal_tensor(inp, target, scale=1.0):
    tensor = torch.randn_like(target)
    desired_norm = inp.norm()
    desired_mean = inp.mean()
    desired_std = inp.std()

    current_norm = tensor.norm()
    tensor = tensor * (desired_norm / current_norm)
    current_std = tensor.std()
    tensor = tensor * (desired_std / current_std)
    tensor = tensor - tensor.mean() + desired_mean
    tensor.mul_(scale)

    target.copy_(tensor)


def init_lokr_network_with_perturbed_normal(lycoris, scale=1e-3):
    with torch.no_grad():
        for lora in lycoris.loras:
            lora.lokr_w1.fill_(1.0)
            approximate_normal_tensor(lora.org_weight, lora.lokr_w2, scale=scale)

This isn't normal PEFT (parameter efficient fine-tuning) anymore, because this will perturb all the weights of the model slightly in the beginning. It doesn't seem to cause any performance degradation in the model after testing and it made the loss fall for my LoKr twice as fast, so I used it with scale=1e-3. The LoKr weights I trained in bfloat16, with the adamw_bf16 optimizer that I plagiarized wrote with the magic of open source software.

Selecting better layers to train with LoKr

FLUX is a pretty standard transformer model aside from some peculiarities. One of these peculiarities is in their "norm" layers, which contain non-linearities so they don't act like norms except for a single normalization that is applied in the layer without any weights (LayerNorm with elementwise_affine=False). When you fine-tune and look at what changes these layers are one of the big ones that seems to change.

The other thing about transformers is that all the heavy lifting is most often done at the start and end layers of the network, so you may as well fine-tune those more than other layers. When I looked at the cosine similarity of the hidden states between each block in diffusion transformers, it more or less reflected what was observed with LLMs. So I made a pull-request to the LyCORIS repository (that maintains a LoKr implementation) that lets you more easily pick individual layers and set different factors on them, then focused my LoKr on these layers.

Beta timestep scheduling and timestep stratification

One problem with diffusion models is that they are multi-task (different timesteps are considered different tasks) and the tasks all tend to be associated with differently shaped and sized gradients and different magnitudes of loss. This is very much not a big deal when you have a huge batch size, so the timesteps of the model all get more or less sampled evenly and the gradients are smoothed out and have less variance. I also knew that the schnell model had more problems with image distortions caused by sampling at the high-noise timesteps, so I did two things:

  1. Implemented a Beta schedule that approximates the original sigmoid sampling, to let me shift the timesteps sampled to the high noise steps similar but less extreme than some of the alternative sampling methods in the SD3 research paper.
  2. Implement multi-rank stratified sampling so that during each step the model trained timesteps were selected per batch based on regions, which normalizes the gradients significantly like using a higher batch size would.
from scipy.stats import beta as sp_beta

alpha = 2.0
beta = 1.6
num_processes = self.accelerator.num_processes
process_index = self.accelerator.process_index
total_bsz = num_processes * bsz
start_idx = process_index * bsz
end_idx = (process_index + 1) * bsz
indices = torch.arange(start_idx, end_idx, dtype=torch.float64)
u = torch.rand(bsz)
p = (indices + u) / total_bsz
sigmas = torch.from_numpy(
    sp_beta.ppf(p.numpy(), a=alpha, b=beta)
).to(device=self.accelerator.device)

Datasets

No one talks about what datasets they train anymore, but I used open ones from the web captioned with VLMs and 2-3 captions per image. There was at least one short and one long caption for every image. The datasets were diverse and most of them did not have aesthetic selection, which helped direct the model away from the traditional hyper-optimized image generation of text-to-image models. Many people think that looks worse, but I like that it can make a diverse pile of images. The model was trained on about 0.5 million high resolution images in both random square crops and random aspect ratio crops.

Training

I started training for over a month on a 5x 3090s and about 500,000 images. I used a 600m LoKr for this at batch size 1 (effective batch size 5 via DDP). The model looked okay after. Then, I unexpectedly gained access to 7x H100s for compute resources, so I merged my PEFT model in and began training on a new LoKr with 3.2b parameters. For the 7x H100 run I ran a batch size of 6 (effective batch size 42 via DDP).

Post-hoc "EMA"

I've been too lazy to implement real post-hoc EMA like from EDM2, but to approximate it I saved all the checkpoints from the H100 runs and then LERPed them iteratively with different alpha values. I evaluated those checkpoints at different CFG scales to see if any of them were superior to the last checkpoint.

first_checkpoint_file = checkpoint_files[0]
ema_state_dict = load_file(first_checkpoint_file)
for checkpoint_file in checkpoint_files[1:]:
    new_state_dict = load_file(checkpoint_file)
    for k in ema_state_dict.keys():
        ema_state_dict[k] = torch.lerp(
            ema_state_dict[k],
            new_state_dict[k],
            alpha,
       )

output_file = os.path.join(output_folder, f"alpha_linear_{alpha}.safetensors")
save_file(ema_state_dict, output_file)

After looking at all models in alphas [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.975, 0.99, 0.995, 0.999], I ended up settling on alpha 0.9 using the power of my eyeballs. If I am being frank, many of the EMA models looked remarkably similar and had the same kind of "rolling around various minima" qualities that training does in general.

Results

I will go over the results briefly, but I'll start with the images.

Figure 1. Some side-by-side images of LibreFLUX and OpenFLUX.1. They were made using diffusers, with 512-token maximum length text embeddings for LibreFLUX and 256-token maximum length for OpenFLUX.1. LibreFLUX had attention masking on while OpenFLUX did not. The models were sampled with 35 steps at various resolutions. The negative prompt for both was simply "blurry". All inference was done with the transformer quantized to int8 by quanto.

A cinematic style shot of a polar bear standing confidently in the center of a vibrant nightclub. The bear is holding a large sign that reads 'Open Source! Apache 2.0' in one arm and giving a thumbs up with the other arm. Around him, the club is alive with energy as colorful lasers and disco lights illuminate the scene. People are dancing all around him, wearing glowsticks and candy bracelets, adding to the fun and electric atmosphere. The polar bear's white fur contrasts against the dark, neon-lit background, and the entire scene has a surreal, festive vibe, blending technology activism with a lively party environment.

widescreen, vintage style from 1970s, Extreme realism in a complex, highly detailed composition featuring a woman with extremely long flowing rainbow-colored hair. The glowing background, with its vibrant colors, exaggerated details, intricate textures, and dynamic lighting, creates a whimsical, dreamy atmosphere in photorealistic quality. Threads of light that float and weave through the air, adding movement and intrigue. Patterns on the ground or in the background that glow subtly, adding a layer of complexity.Rainbows that appear faintly in the background, adding a touch of color and wonder.Butterfly wings that shimmer in the light, adding life and movement to the scene.Beams of light that radiate softly through the scene, adding focus and direction. The woman looks away from the camera, with a soft, wistful expression, her hair framing her face.

a highly detailed and atmospheric, painted western movie poster with the title text "Once Upon a Lime in the West" in a dark red western-style font and the tagline text "There were three men ... and one very sour twist", with movie credits at the bottom, featuring small white text detailing actor and director names and production company logos, inspired by classic western movie posters from the 1960s, an oversized lime is the central element in the middle ground of a rugged, sun-scorched desert landscape typical of a western, the vast expanse of dry, cracked earth stretches toward the horizon, framed by towering red rock formations, the absurdity of the lime is juxtaposed with the intense gravitas of the stoic, iconic gunfighters, as if the lime were as formidable an adversary as any seasoned gunslinger, in the foreground, the silhouettes of two iconic gunfighters stand poised, facing the lime and away from the viewer, the lime looms in the distance like a final showdown in the classic western tradition, in the foreground, the gunfighters stand with long duster coats flowing in the wind, and wide-brimmed hats tilted to cast shadows over their faces, their stances are tense, as if ready for the inevitable draw, and the weapons they carry glint, the background consists of the distant town, where the sun is casting a golden glow, old wooden buildings line the sides, with horses tied to posts and a weathered saloon sign swinging gently in the wind, in this poster, the lime plays the role of the silent villain, an almost mythical object that the gunfighters are preparing to confront, the tension of the scene is palpable, the gunfighters in the foreground have faces marked by dust and sweat, their eyes narrowed against the bright sunlight, their expressions are serious and resolute, as if they have come a long way for this final duel, the absurdity of the lime is in stark contrast with their stoic demeanor, a wide, panoramic shot captures the entire scene, with the gunfighters in the foreground, the lime in the mid-ground, and the town on the horizon, the framing emphasizes the scale of the desert and the dramatic standoff taking place, while subtly highlighting the oversized lime, the camera is positioned low, angled upward from the dusty ground toward the gunfighters, with the distant lime looming ahead, this angle lends the figures an imposing presence, while still giving the lime an absurd grandeur in the distance, the perspective draws the viewer’s eye across the desert, from the silhouettes of the gunfighters to the bizarre focal point of the lime, amplifying the tension, the lighting is harsh and unforgiving, typical of a desert setting, with the evening sun casting deep shadows across the ground, dust clouds drift subtly across the ground, creating a hazy effect, while the sky above is a vast expanse of pale blue, fading into golden hues near the horizon where the sun begins to set, the poster is shot as if using classic anamorphic lenses to capture the wide, epic scale of the desert, the color palette is warm and saturated, evoking the look of a classic spaghetti western, the lime looms unnaturally in the distance, as if conjured from the land itself, casting an absurdly grand shadow across the rugged landscape, the texture and detail evoke hand-painted, weathered posters from the golden age of westerns, with slightly frayed edges and faint creases mimicking the wear of vintage classics

A boxed action figure of a beautiful elf girl witch wearing a skimpy black leotard, black thigh highs, black armlets, and a short black cloak. Her hair is pink and shoulder-length. Her eyes are green. She is a slim and attractive elf with small breasts. The accessories include an apple, magic wand, potion bottle, black cat, jack o lantern, and a book. The box is orange and black with a logo near the bottom of it that says "BAD WITCH". The box is on a shelf on the toy aisle.

A cute blonde woman in bikini and her doge are sitting on a couch cuddling and the expressive, stylish living room scene with a playful twist. The room is painted in a soothing turquoise color scheme, stylish living room scene bathed in a cool, textured turquoise blanket and adorned with several matching turquoise throw pillows. The room's color scheme is predominantly turquoise, relaxed demeanor. The couch is covered in a soft, reflecting light and adding to the vibrant blue hue., dark room with a sleek, spherical gold decorations, This photograph captures a scene that is whimsically styled in a vibrant, reflective cyan sunglasses. The dog's expression is cheerful, metallic fabric sofa. The dog, soothing atmosphere.

Selfie of a woman in front of the eiffel tower, a man is standing next to her and giving a thumbs up

An image contains three motivational phrases, all in capitalized stylized text on a colorful background: 1. At the top: "PAIN HEALS" 2. In the middle, bold and slightly larger: "CHICKS DIG SCARS" 3. At the bottom: "GLORY LASTS FOREVER"

An illustration featuring a McDonald's on the moon. An anthropomorphic cat in a pink top and blue jeans is ordering McDonald's, while a zebra cashier stands behind the counter. The moon's surface is visible outside the windows, with craters and a distant view of Earth. The interior of the McDonald's is similar to those on Earth but adapted to the lunar environment, with vibrant colors and futuristic design elements. The overall scene is whimsical and imaginative, blending everyday life with a fantastical setting.

LibreFLUX and OpenFLUX have their strengths and weaknesses. OpenFLUX was de-distilled using the outputs of FLUX.1-schnell, which might explain why it's worse at text but also has the FLUX hyperaesthetics. Text-to-image models don't have any good metrics so past a point of "soupiness" and single digit FID you just need to look at the model and see if it fits what you think nice pictures are.

Both models appear to be terrible at making drawings. Because people are probably curious to see the non-cherry picks, I've included CFG sweep comparisons of both LibreFLUX and OpenFLUX.1 here. I'm not going to say this is the best model ever, but it might be a springboard for people wanting to finetune better models from.

Closing thoughts

If I had to do it again, I'd probably raise the learning rate more on the H100 run. There was a bug in SimpleTuner that caused me to not use the initialization trick when on the H100s, then timestep stratification ended up quieting down the gradient magnitudes even more and caused the model to learn very slowly at 1e-5. I realized this when looking at the results of EMA on the final FLUX.1-dev. The H100s really came out of nowhere as I just got an IP address to shell into late one night around 10PM and ended up staying up all night to get everything running, so in the future I'm sure I would be more prepared.

For de-distillation of schnell I think you probably need a lot more than 1500 H100-equivalent hours. I am very tired of training FLUX and am looking forward to a better model with less parameters. The model learns new concepts slowly when given piles of well labeled data. Given the history of LLMs, we now have models like LLaMA 3.1 8B that trade blows with GPT3.5 175B and I am hopeful that the future holds smaller, faster models that look better.

As far as what I think of the FLUX "open source", many models being trained and released today are attempts at raising VC cash and I have noticed a mountain of them being promoted on Twitter. Since a16z poached the entire SD3 dev team from Stability.ai the field feels more toxic than ever, but I am hopeful for individuals and research labs to selflessly lead the path forward for open weights. I made zero dollars on this and have made zero dollars on ML to date, but I try to make contributions where I can.

I would like to thank RunWare for the H100 access.

Contacting me and grants

You can contact me by opening an issue on the discuss page of this model. If you want to speak privately about grants because you want me to continue training this or give me a means to conduct reproducible research, leave an email address too.

Citation

@misc{libreflux,
  author = {James Carter},
  title = {LibreFLUX: A free, de-distilled FLUX model},
  year = {2024},
  publisher = {Huggingface},
  journal = {Huggingface repository},
  howpublished = {\url{https://huggingface.co/datasets/jimmycarter/libreflux}},
}