Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Abstract
In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at https://github.com/czg1225/CoDe
Community
๐ GitHub: https://github.com/czg1225/CoDe
๐ Project Page: https://czg1225.github.io/CoDe_page/
๐ Model Weights: https://huggingface.co/Zigeng/VAR_CoDe
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding (2024)
- LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding (2024)
- HART: Efficient Visual Generation with Hybrid Autoregressive Transformer (2024)
- CAR: Controllable Autoregressive Modeling for Visual Generation (2024)
- Denoising with a Joint-Embedding Predictive Architecture (2024)
- Randomized Autoregressive Visual Generation (2024)
- DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper