Hunyuan-DiT

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding from Tencent Hunyuan.

The abstract from the paper is:

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.

You can find the original codebase at Tencent/HunyuanDiT and all the available checkpoints at Tencent-Hunyuan.

Highlights: HunyuanDiT supports Chinese/English-to-image, multi-resolution generation.

HunyuanDiT has the following components:

It uses a diffusion transformer as the backbone
It combines two text encoders, a bilingual CLIP and a multilingual T5 encoder

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Optimization

You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the Speed up inference and Reduce memory usage guides.

Inference

Use torch.compile to reduce the inference latency.

First, load the pipeline:

from diffusers import HunyuanDiTPipeline
import torch 

pipeline = HunyuanDiTPipeline.from_pretrained(
    "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
).to("cuda")

Then change the memory layout of the pipelines transformer and vae components to torch.channels-last:

pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)

Finally, compile the components and run inference:

pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True)

image = pipeline(prompt="一个宇航员在骑马").images[0]

The benchmark results on a 80GB A100 machine are:

With torch.compile(): Average inference time: 12.470 seconds.
Without torch.compile(): Average inference time: 20.570 seconds.

Memory optimization

By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to this script for details.

Furthermore, you can use the [~HunyuanDiT2DModel.enable_forward_chunking] method to reduce memory usage. Feed-forward chunking runs the feed-forward layers in a transformer block in a loop instead of all at once. This gives you a trade-off between memory consumption and inference runtime.

+ pipeline.transformer.enable_forward_chunking(chunk_size=1, dim=1)

HunyuanDiTPipeline

[[autodoc]] HunyuanDiTPipeline - all - call