X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
Abstract
In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.
Community
The first pure multimodal in-context image generation model based on Chameleon!
https://github.com/SunzeY/X-Prompt
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CART: Compositional Auto-Regressive Transformer for Image Generation (2024)
- Active Prompt Learning with Vision-Language Model Priors (2024)
- In the Era of Prompt Learning with Vision-Language Models (2024)
- EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval (2024)
- Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning (2024)
- NLPrompt: Noise-Label Prompt Learning for Vision-Language Models (2024)
- WeatherGFM: Learning A Weather Generalist Foundation Model via In-context Learning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper