One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Abstract
FAE, a framework using a feature auto-encoder and dual decoders, adapts pre-trained visual representations for generative models, achieving high performance in image generation tasks.
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
Community
We proposed FAE which adapts pretrained ViT as the latent space for visual generative models
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Adapting Self-Supervised Representations as a Latent Space for Efficient Generation (2025)
- Latent Diffusion Model without Variational Autoencoder (2025)
- Diffusion Transformers with Representation Autoencoders (2025)
- UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation (2025)
- Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models (2025)
- There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training (2025)
- DINO-Tok: Adapting DINO for Visual Tokenizers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper