InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
Abstract
InfGen, a one-step generator replacing the VAE decoder, enables arbitrary high-resolution image generation from a fixed-size latent, significantly reducing computational complexity and generation time.
Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the InfGen, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.
Community
Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the InfGen, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PixNerd: Pixel Neural Field Diffusion (2025)
- CineScale: Free Lunch in High-Resolution Cinematic Visual Generation (2025)
- GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation (2025)
- APT: Improving Diffusion Models for High Resolution Image Generation with Adaptive Path Tracing (2025)
- HiMat: DiT-based Ultra-High Resolution SVBRDF Generation (2025)
- Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis (2025)
- Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Very interesting paper. I do wonder if this method can be used for native low-resolution image generation too, such as pixel art. The lower end of the 'reliable exploration' is 256, but I'm wondering if sub 256 was unexplored due to an assumption that low res images aren't desirable.
True arbitrary resolution should also generalize on the extreme low end, right?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
 taesiri
							taesiri 
					 
					 
						 
						 
					 
					