# T2I-Adapter [T2I-Adapter](https://hf.co/papers/2302.08453) is a lightweight adapter model that provides an additional conditioning input image (line art, canny, sketch, depth, pose) to better control image generation. It is similar to a ControlNet, but it is a lot smaller (~77M parameters and ~300MB file size) because its only inserts weights into the UNet instead of copying and training it. The T2I-Adapter is only available for training with the Stable Diffusion XL (SDXL) model. This guide will explore the [train_t2i_adapter_sdxl.py](https://github.com/huggingface/diffusers/blob/main/examples/t2i_adapter/train_t2i_adapter_sdxl.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. Before running the script, make sure you install the library from source: ```bash git clone https://github.com/huggingface/diffusers cd diffusers pip install . ``` Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: ```bash cd examples/t2i_adapter pip install -r requirements.txt ``` πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. Initialize an πŸ€— Accelerate environment: ```bash accelerate config ``` To setup a default πŸ€— Accelerate environment without choosing any configurations: ```bash accelerate config default ``` Or if your environment doesn't support an interactive shell, like a notebook, you can use: ```py from accelerate.utils import write_basic_config write_basic_config() ``` Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/t2i_adapter/train_t2i_adapter_sdxl.py) and let us know if you have any questions or concerns. ## Script parameters The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L233) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. For example, to activate gradient accumulation, add the `--gradient_accumulation_steps` parameter to the training command: ```bash accelerate launch train_t2i_adapter_sdxl.py \ ----gradient_accumulation_steps=4 ``` Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the relevant T2I-Adapter parameters: - `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) - `--crops_coords_top_left_h` and `--crops_coords_top_left_w`: height and width coordinates to include in SDXL's crop coordinate embeddings - `--conditioning_image_column`: the column of the conditioning images in the dataset - `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings ## Training script As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the T2I-Adapter relevant parts of the script. The training script begins by preparing the dataset. This incudes [tokenizing](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L674) the prompt and [applying transforms](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L714) to the images and conditioning images. ```py conditioning_image_transforms = transforms.Compose( [ transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR), transforms.CenterCrop(args.resolution), transforms.ToTensor(), ] ) ``` Within the [`main()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L770) function, the T2I-Adapter is either loaded from a pretrained adapter or it is randomly initialized: ```py if args.adapter_model_name_or_path: logger.info("Loading existing adapter weights.") t2iadapter = T2IAdapter.from_pretrained(args.adapter_model_name_or_path) else: logger.info("Initializing t2iadapter weights.") t2iadapter = T2IAdapter( in_channels=3, channels=(320, 640, 1280, 1280), num_res_blocks=2, downscale_factor=16, adapter_type="full_adapter_xl", ) ``` The [optimizer](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L952) is initialized for the T2I-Adapter parameters: ```py params_to_optimize = t2iadapter.parameters() optimizer = optimizer_class( params_to_optimize, lr=args.learning_rate, betas=(args.adam_beta1, args.adam_beta2), weight_decay=args.adam_weight_decay, eps=args.adam_epsilon, ) ``` Lastly, in the [training loop](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L1086), the adapter conditioning image and the text embeddings are passed to the UNet to predict the noise residual: ```py t2iadapter_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype) down_block_additional_residuals = t2iadapter(t2iadapter_image) down_block_additional_residuals = [ sample.to(dtype=weight_dtype) for sample in down_block_additional_residuals ] model_pred = unet( inp_noisy_latents, timesteps, encoder_hidden_states=batch["prompt_ids"], added_cond_kwargs=batch["unet_added_conditions"], down_block_additional_residuals=down_block_additional_residuals, ).sample ``` If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. ## Launch the script Now you’re ready to launch the training script! πŸš€ For this example training, you'll use the [fusing/fill50k](https://huggingface.co/datasets/fusing/fill50k) dataset. You can also create and use your own dataset if you want (see the [Create a dataset for training](https://moon-ci-docs.huggingface.co/docs/diffusers/pr_5512/en/training/create_dataset) guide). Set the environment variable `MODEL_DIR` to a model id on the Hub or a path to a local model and `OUTPUT_DIR` to where you want to save the model. Download the following images to condition your training with: ```bash wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png ``` To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You'll also need to add the `--validation_image`, `--validation_prompt`, and `--validation_steps` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results. ```bash export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0" export OUTPUT_DIR="path to save model" accelerate launch train_t2i_adapter_sdxl.py \ --pretrained_model_name_or_path=$MODEL_DIR \ --output_dir=$OUTPUT_DIR \ --dataset_name=fusing/fill50k \ --mixed_precision="fp16" \ --resolution=1024 \ --learning_rate=1e-5 \ --max_train_steps=15000 \ --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ --validation_steps=100 \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --report_to="wandb" \ --seed=42 \ --push_to_hub ``` Once training is complete, you can use your T2I-Adapter for inference: ```py from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, EulerAncestralDiscreteSchedulerTest from diffusers.utils import load_image import torch adapter = T2IAdapter.from_pretrained("path/to/adapter", torch_dtype=torch.float16) pipeline = StableDiffusionXLAdapterPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", adapter=adapter, torch_dtype=torch.float16 ) pipeline.scheduler = EulerAncestralDiscreteSchedulerTest.from_config(pipe.scheduler.config) pipeline.enable_xformers_memory_efficient_attention() pipeline.enable_model_cpu_offload() control_image = load_image("./conditioning_image_1.png") prompt = "pale golden rod circle with old lace background" generator = torch.manual_seed(0) image = pipeline( prompt, image=control_image, generator=generator ).images[0] image.save("./output.png") ``` ## Next steps Congratulations on training a T2I-Adapter model! πŸŽ‰ To learn more: - Read the [Efficient Controllable Generation for SDXL with T2I-Adapters](https://huggingface.co/blog/t2i-sdxl-adapters) blog post to learn more details about the experimental results from the T2I-Adapter team.