|
# Stable Diffusion text-to-image fine-tuning |
|
This extended LoRA training script was authored by [haofanwang](https://github.com/haofanwang). |
|
This is an experimental LoRA extension of [this example](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py). We further support add LoRA layers for text encoder. |
|
|
|
## Training with LoRA |
|
|
|
Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*. |
|
|
|
In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition matrices to existing weights and **only** training those newly added weights. This has a couple of advantages: |
|
|
|
- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114). |
|
- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable. |
|
- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter. |
|
|
|
[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. |
|
|
|
With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset |
|
on consumer GPUs like Tesla T4, Tesla V100. |
|
|
|
### Training |
|
|
|
First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion v1-4](https://hf.co/CompVis/stable-diffusion-v1-4) and the [Narutos dataset](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions). |
|
|
|
**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___** |
|
|
|
**___Note: It is quite useful to monitor the training progress by regularly generating sample images during training. [Weights and Biases](https://docs.wandb.ai/quickstart) is a nice solution to easily see generating images during training. All you need to do is to run `pip install wandb` before training to automatically log images.___** |
|
|
|
```bash |
|
export MODEL_NAME="CompVis/stable-diffusion-v1-4" |
|
export DATASET_NAME="lambdalabs/naruto-blip-captions" |
|
``` |
|
|
|
For this example we want to directly store the trained LoRA embeddings on the Hub, so |
|
we need to be logged in and add the `--push_to_hub` flag. |
|
|
|
```bash |
|
huggingface-cli login |
|
``` |
|
|
|
Now we can start training! |
|
|
|
```bash |
|
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \ |
|
--pretrained_model_name_or_path=$MODEL_NAME \ |
|
--dataset_name=$DATASET_NAME --caption_column="text" \ |
|
--resolution=512 --random_flip \ |
|
--train_batch_size=1 \ |
|
--num_train_epochs=100 --checkpointing_steps=5000 \ |
|
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \ |
|
--seed=42 \ |
|
--output_dir="sd-naruto-model-lora" \ |
|
--validation_prompt="cute dragon creature" --report_to="wandb" |
|
--use_peft \ |
|
--lora_r=4 --lora_alpha=32 \ |
|
--lora_text_encoder_r=4 --lora_text_encoder_alpha=32 |
|
``` |
|
|
|
The above command will also run inference as fine-tuning progresses and log the results to Weights and Biases. |
|
|
|
**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___** |
|
|
|
The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.___** |
|
|
|
You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw). |
|
|
|
### Inference |
|
|
|
Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights. You |
|
need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sd-naruto-model-lora`. |
|
|
|
```python |
|
from diffusers import StableDiffusionPipeline |
|
import torch |
|
|
|
model_path = "sayakpaul/sd-model-finetuned-lora-t4" |
|
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16) |
|
pipe.unet.load_attn_procs(model_path) |
|
pipe.to("cuda") |
|
|
|
prompt = "A naruto with green eyes and red legs." |
|
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0] |
|
image.save("naruto.png") |
|
``` |