Add metadata, link to paper

#1
by nielsr HF staff - opened
Files changed (1) hide show
  1. README.md +150 -3
README.md CHANGED
@@ -1,3 +1,150 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-to-image
4
+ library_name: diffusers
5
+ ---
6
+
7
+ <div align="center">
8
+
9
+ <h2>⚡Reconstruction <i>vs.</i> Generation:
10
+
11
+ Taming Optimization Dilemma in Latent Diffusion Models</h2>
12
+
13
+ **_FID=1.35 on ImageNet-256 & 21.8x faster training than DiT!_**
14
+
15
+ [Jingfeng Yao](https://github.com/JingfengYao), [Xinggang Wang](https://xwcv.github.io/index.htm)*
16
+
17
+ Huazhong University of Science and Technology (HUST)
18
+
19
+ *Corresponding author: xgwang@hust.edu.cn
20
+
21
+ [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/reconstruction-vs-generation-taming/image-generation-on-imagenet-256x256)](https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=reconstruction-vs-generation-taming)
22
+ <!-- [![arXiv](https://img.shields.io/badge/arXiv-VA_VAE-b31b1b.svg)]()
23
+ [![arXiv](https://img.shields.io/badge/arXiv-FasterDiT-b31b1b.svg)](https://arxiv.org/abs/2410.10356) -->
24
+ [![license](https://img.shields.io/badge/license-MIT-blue)](LICENSE)
25
+ [![authors](https://img.shields.io/badge/by-hustvl-green)](https://github.com/hustvl)
26
+ [![paper](https://img.shields.io/badge/arXiv-VA_VAE-b31b1b.svg)](https://huggingface.co/papers/2501.01423)
27
+ [![arXiv](https://img.shields.io/badge/arXiv-FasterDiT-b31b1b.svg)](https://arxiv.org/abs/2410.10356)
28
+
29
+
30
+
31
+
32
+ </div>
33
+ <div align="center">
34
+ <img src="images/vis.png" alt="Visualization">
35
+ </div>
36
+
37
+ ## ✨ Highlights
38
+
39
+ - Latent diffusion system with 0.28 rFID and **1.35 FID on ImageNet-256** generation, **surpassing all published state-of-the-art**!
40
+
41
+ - **More than 21.8× faster** convergence with **VA-VAE** and **LightningDiT** than original DiT!
42
+
43
+ - **Surpass DiT with FID=2.11 with only 8 GPUs in about 10 hours**. Let's make diffusion transformers research more affordable!
44
+
45
+ ## 📰 News
46
+
47
+ - **[2025.01.02]** We have released the pre-trained weights.
48
+
49
+ - **[2025.01.01]** We release the code and paper for VA-VAE and LightningDiT! The weights and pre-extracted latents will be released soon.
50
+
51
+ ## 📄 Introduction
52
+
53
+ Latent diffusion models (LDMs) with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an **optimization dilemma** in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance.
54
+ Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs.
55
+
56
+ We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces.
57
+ To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
58
+ The integrated system demonstrates remarkable training efficiency by reaching FID=2.11 in just 64 epochs -- an over 21× convergence speedup over the original DiT implementations, while achieving state-of-the-art performance on ImageNet-256 image generation with FID=1.35.
59
+
60
+ ## 📝 Results
61
+
62
+ - State-of-the-art Performance on ImageNet 256x256 with FID=1.35.
63
+ - Surpass DiT within only 64 epochs training, achieving 21.8x speedup.
64
+
65
+ <div align="center">
66
+ <img src="images/results.png" alt="Results">
67
+ </div>
68
+
69
+ ## 🎯 How to Use
70
+
71
+ ### Installation
72
+
73
+ ```
74
+ conda create -n lightningdit python=3.10.12
75
+ conda activate lightningdit
76
+ pip install -r requirements.txt
77
+ ```
78
+
79
+
80
+ ### Inference with Pre-trained Models
81
+
82
+ - Download weights and data infos:
83
+
84
+ - Download pre-trained models
85
+ | Tokenizer | Generation Model | FID | FID cfg |
86
+ |:---------:|:----------------|:----:|:---:|
87
+ | [VA-VAE](https://huggingface.co/hustvl/vavae-imagenet256-f16d32-dinov2/blob/main/vavae-imagenet256-f16d32-dinov2.pt) | [LightningDiT-XL-800ep](https://huggingface.co/hustvl/lightningdit-xl-imagenet256-800ep/blob/main/lightningdit-xl-imagenet256-800ep.pt) | 2.17 | 1.35 |
88
+ | | [LightningDiT-XL-64ep](https://huggingface.co/hustvl/lightningdit-xl-imagenet256-64ep/blob/main/lightningdit-xl-imagenet256-64ep.pt) | 5.14 | 2.11 |
89
+
90
+ - Download [latent statistics](https://huggingface.co/hustvl/vavae-imagenet256-f16d32-dinov2/blob/main/latents_stats.pt). This file contains the channel-wise mean and standard deviation statistics.
91
+
92
+ - Modify config file in ``configs/reproductions`` as required.
93
+
94
+ - Fast sample demo images:
95
+
96
+ Run:
97
+ ```
98
+ bash bash run_fast_inference.sh ${config_path}
99
+ ```
100
+ Images will be saved into ``demo_images/demo_samples.png``, e.g. the following one:
101
+ <div align="center">
102
+ <img src="images/demo_samples.png" alt="Demo Samples" width="600">
103
+ </div>
104
+
105
+ - Sample for FID-50k evaluation:
106
+
107
+ Run:
108
+ ```
109
+ bash run_inference.sh ${config_path}
110
+ ```
111
+ NOTE: The FID result reported by the script serves as a reference value. The final FID-50k reported in paper is evaluated with ADM:
112
+
113
+ ```
114
+ git clone https://github.com/openai/guided-diffusion.git
115
+
116
+ # save your npz file with tools/save_npz.py
117
+ bash run_fid_eval.sh /path/to/your.npz
118
+ ```
119
+
120
+ ## 🎮 Train Your Own Models
121
+
122
+
123
+ - **We provide a 👆[detailed tutorial](docs/tutorial.md) for training your own models of 2.1 FID score within only 64 epochs. It takes only about 10 hours with 8 x H800 GPUs.**
124
+
125
+
126
+ ## ❤️ Acknowledgements
127
+
128
+ This repo is mainly built on [DiT](https://github.com/facebookresearch/DiT), [FastDiT](https://github.com/chuanyangjin/fast-DiT) and [SiT](https://github.com/willisma/SiT). Our VAVAE codes are mainly built with [LDM](https://github.com/CompVis/latent-diffusion) and [MAR](https://github.com/LTH14/mar). Thanks for all these great works.
129
+
130
+ ## 📝 Citation
131
+
132
+ If you find our work useful, please consider to cite our related paper:
133
+
134
+ ```
135
+ # arxiv preprint
136
+ @article{vavae,
137
+ title={Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models},
138
+ author={Yao, Jingfeng and Wang, Xinggang},
139
+ journal={arXiv preprint arXiv:2501.01423},
140
+ year={2025}
141
+ }
142
+
143
+ # NeurIPS 24
144
+ @article{fasterdit,
145
+ title={FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification},
146
+ author={Yao, Jingfeng and Wang, Cheng and Liu, Wenyu and Wang, Xinggang},
147
+ journal={arXiv preprint arXiv:2410.10356},
148
+ year={2024}
149
+ }
150
+ ```