Add metadata, link to paper
Browse filesThis PR adds the pipeline tag, to help users find the model at https://huggingface.co/models?pipeline_tag=text-to-image, as well as a link to the paper page.
README.md
CHANGED
@@ -1,3 +1,150 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
pipeline_tag: text-to-image
|
4 |
+
library_name: diffusers
|
5 |
+
---
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
|
9 |
+
<h2>⚡Reconstruction <i>vs.</i> Generation:
|
10 |
+
|
11 |
+
Taming Optimization Dilemma in Latent Diffusion Models</h2>
|
12 |
+
|
13 |
+
**_FID=1.35 on ImageNet-256 & 21.8x faster training than DiT!_**
|
14 |
+
|
15 |
+
[Jingfeng Yao](https://github.com/JingfengYao), [Xinggang Wang](https://xwcv.github.io/index.htm)*
|
16 |
+
|
17 |
+
Huazhong University of Science and Technology (HUST)
|
18 |
+
|
19 |
+
*Corresponding author: xgwang@hust.edu.cn
|
20 |
+
|
21 |
+
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/reconstruction-vs-generation-taming/image-generation-on-imagenet-256x256)](https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=reconstruction-vs-generation-taming)
|
22 |
+
<!-- [![arXiv](https://img.shields.io/badge/arXiv-VA_VAE-b31b1b.svg)]()
|
23 |
+
[![arXiv](https://img.shields.io/badge/arXiv-FasterDiT-b31b1b.svg)](https://arxiv.org/abs/2410.10356) -->
|
24 |
+
[![license](https://img.shields.io/badge/license-MIT-blue)](LICENSE)
|
25 |
+
[![authors](https://img.shields.io/badge/by-hustvl-green)](https://github.com/hustvl)
|
26 |
+
[![paper](https://img.shields.io/badge/arXiv-VA_VAE-b31b1b.svg)](https://huggingface.co/papers/2501.01423)
|
27 |
+
[![arXiv](https://img.shields.io/badge/arXiv-FasterDiT-b31b1b.svg)](https://arxiv.org/abs/2410.10356)
|
28 |
+
|
29 |
+
|
30 |
+
|
31 |
+
|
32 |
+
</div>
|
33 |
+
<div align="center">
|
34 |
+
<img src="images/vis.png" alt="Visualization">
|
35 |
+
</div>
|
36 |
+
|
37 |
+
## ✨ Highlights
|
38 |
+
|
39 |
+
- Latent diffusion system with 0.28 rFID and **1.35 FID on ImageNet-256** generation, **surpassing all published state-of-the-art**!
|
40 |
+
|
41 |
+
- **More than 21.8× faster** convergence with **VA-VAE** and **LightningDiT** than original DiT!
|
42 |
+
|
43 |
+
- **Surpass DiT with FID=2.11 with only 8 GPUs in about 10 hours**. Let's make diffusion transformers research more affordable!
|
44 |
+
|
45 |
+
## 📰 News
|
46 |
+
|
47 |
+
- **[2025.01.02]** We have released the pre-trained weights.
|
48 |
+
|
49 |
+
- **[2025.01.01]** We release the code and paper for VA-VAE and LightningDiT! The weights and pre-extracted latents will be released soon.
|
50 |
+
|
51 |
+
## 📄 Introduction
|
52 |
+
|
53 |
+
Latent diffusion models (LDMs) with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an **optimization dilemma** in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance.
|
54 |
+
Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs.
|
55 |
+
|
56 |
+
We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces.
|
57 |
+
To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
|
58 |
+
The integrated system demonstrates remarkable training efficiency by reaching FID=2.11 in just 64 epochs -- an over 21× convergence speedup over the original DiT implementations, while achieving state-of-the-art performance on ImageNet-256 image generation with FID=1.35.
|
59 |
+
|
60 |
+
## 📝 Results
|
61 |
+
|
62 |
+
- State-of-the-art Performance on ImageNet 256x256 with FID=1.35.
|
63 |
+
- Surpass DiT within only 64 epochs training, achieving 21.8x speedup.
|
64 |
+
|
65 |
+
<div align="center">
|
66 |
+
<img src="images/results.png" alt="Results">
|
67 |
+
</div>
|
68 |
+
|
69 |
+
## 🎯 How to Use
|
70 |
+
|
71 |
+
### Installation
|
72 |
+
|
73 |
+
```
|
74 |
+
conda create -n lightningdit python=3.10.12
|
75 |
+
conda activate lightningdit
|
76 |
+
pip install -r requirements.txt
|
77 |
+
```
|
78 |
+
|
79 |
+
|
80 |
+
### Inference with Pre-trained Models
|
81 |
+
|
82 |
+
- Download weights and data infos:
|
83 |
+
|
84 |
+
- Download pre-trained models
|
85 |
+
| Tokenizer | Generation Model | FID | FID cfg |
|
86 |
+
|:---------:|:----------------|:----:|:---:|
|
87 |
+
| [VA-VAE](https://huggingface.co/hustvl/vavae-imagenet256-f16d32-dinov2/blob/main/vavae-imagenet256-f16d32-dinov2.pt) | [LightningDiT-XL-800ep](https://huggingface.co/hustvl/lightningdit-xl-imagenet256-800ep/blob/main/lightningdit-xl-imagenet256-800ep.pt) | 2.17 | 1.35 |
|
88 |
+
| | [LightningDiT-XL-64ep](https://huggingface.co/hustvl/lightningdit-xl-imagenet256-64ep/blob/main/lightningdit-xl-imagenet256-64ep.pt) | 5.14 | 2.11 |
|
89 |
+
|
90 |
+
- Download [latent statistics](https://huggingface.co/hustvl/vavae-imagenet256-f16d32-dinov2/blob/main/latents_stats.pt). This file contains the channel-wise mean and standard deviation statistics.
|
91 |
+
|
92 |
+
- Modify config file in ``configs/reproductions`` as required.
|
93 |
+
|
94 |
+
- Fast sample demo images:
|
95 |
+
|
96 |
+
Run:
|
97 |
+
```
|
98 |
+
bash bash run_fast_inference.sh ${config_path}
|
99 |
+
```
|
100 |
+
Images will be saved into ``demo_images/demo_samples.png``, e.g. the following one:
|
101 |
+
<div align="center">
|
102 |
+
<img src="images/demo_samples.png" alt="Demo Samples" width="600">
|
103 |
+
</div>
|
104 |
+
|
105 |
+
- Sample for FID-50k evaluation:
|
106 |
+
|
107 |
+
Run:
|
108 |
+
```
|
109 |
+
bash run_inference.sh ${config_path}
|
110 |
+
```
|
111 |
+
NOTE: The FID result reported by the script serves as a reference value. The final FID-50k reported in paper is evaluated with ADM:
|
112 |
+
|
113 |
+
```
|
114 |
+
git clone https://github.com/openai/guided-diffusion.git
|
115 |
+
|
116 |
+
# save your npz file with tools/save_npz.py
|
117 |
+
bash run_fid_eval.sh /path/to/your.npz
|
118 |
+
```
|
119 |
+
|
120 |
+
## 🎮 Train Your Own Models
|
121 |
+
|
122 |
+
|
123 |
+
- **We provide a 👆[detailed tutorial](docs/tutorial.md) for training your own models of 2.1 FID score within only 64 epochs. It takes only about 10 hours with 8 x H800 GPUs.**
|
124 |
+
|
125 |
+
|
126 |
+
## ❤️ Acknowledgements
|
127 |
+
|
128 |
+
This repo is mainly built on [DiT](https://github.com/facebookresearch/DiT), [FastDiT](https://github.com/chuanyangjin/fast-DiT) and [SiT](https://github.com/willisma/SiT). Our VAVAE codes are mainly built with [LDM](https://github.com/CompVis/latent-diffusion) and [MAR](https://github.com/LTH14/mar). Thanks for all these great works.
|
129 |
+
|
130 |
+
## 📝 Citation
|
131 |
+
|
132 |
+
If you find our work useful, please consider to cite our related paper:
|
133 |
+
|
134 |
+
```
|
135 |
+
# arxiv preprint
|
136 |
+
@article{vavae,
|
137 |
+
title={Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models},
|
138 |
+
author={Yao, Jingfeng and Wang, Xinggang},
|
139 |
+
journal={arXiv preprint arXiv:2501.01423},
|
140 |
+
year={2025}
|
141 |
+
}
|
142 |
+
|
143 |
+
# NeurIPS 24
|
144 |
+
@article{fasterdit,
|
145 |
+
title={FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification},
|
146 |
+
author={Yao, Jingfeng and Wang, Cheng and Liu, Wenyu and Wang, Xinggang},
|
147 |
+
journal={arXiv preprint arXiv:2410.10356},
|
148 |
+
year={2024}
|
149 |
+
}
|
150 |
+
```
|