Würstchen text-to-image fine-tuning

Running locally with PyTorch

Before running the scripts, make sure to install the library's training dependencies:

Important

To make sure you can successfully run the latest versions of the example scripts, we highly recommend installing from source and keeping the install up to date. To do this, execute the following steps in a new virtual environment:

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

Then cd into the example folder and run

cd examples/wuerstchen/text_to_image
pip install -r requirements.txt

And initialize an 🤗Accelerate environment with:

accelerate config

For this example we want to directly store the trained LoRA embeddings on the Hub, so we need to be logged in and add the --push_to_hub flag to the training script. To log in, run:

huggingface-cli login

Prior training

You can fine-tune the Würstchen prior model with the train_text_to_image_prior.py script. Note that we currently support --gradient_checkpointing for prior model fine-tuning so you can use it for more GPU memory constrained setups.

export DATASET_NAME="lambdalabs/naruto-blip-captions"

accelerate launch  train_text_to_image_prior.py \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME \
  --resolution=768 \
  --train_batch_size=4 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --dataloader_num_workers=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --checkpoints_total_limit=3 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --validation_prompts="A robot naruto, 4k photo" \
  --report_to="wandb" \
  --push_to_hub \
  --output_dir="wuerstchen-prior-naruto-model"

Training with LoRA

Low-Rank Adaption of Large Language Models (or LoRA) was first introduced by Microsoft in LoRA: Low-Rank Adaptation of Large Language Models by Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen.

In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition matrices to existing weights and only training those newly added weights. This has a couple of advantages:

Previous pretrained weights are kept frozen so that the model is not prone to catastrophic forgetting.
Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable.
LoRA attention layers allow to control to which extent the model is adapted toward new training images via a scale parameter.

Prior Training

First, you need to set up your development environment as explained in the installation section. Make sure to set the DATASET_NAME environment variable. Here, we will use the Naruto captions dataset.

export DATASET_NAME="lambdalabs/naruto-blip-captions"

accelerate launch train_text_to_image_lora_prior.py \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=768 \
  --train_batch_size=8 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --rank=4 \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --push_to_hub \
  --output_dir="wuerstchen-prior-naruto-lora"