Abstract
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
Community
MIO is a foundation model integrating both multimodal understanding and generation. It can support four modalities: image, video (frame sequence), speech, and text. MIO natively supports multimodal interleaved output and context-aware image generation (in contrast with descriptive image generation). It is also enhanced for interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
Models | Emu1 | Emu2 | SEED-LLaMA | AnyGPT | CM3Leon, Chameleon | Gemini | Transfusion | MIO (ours) |
---|---|---|---|---|---|---|---|---|
I/O Consistency | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ |
Unified Bidirectional SFT | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ |
Multi-Task SFT | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | ✔️ | ❌ | ✔️ |
Speech Input/Output | ❌/❌ | ❌/❌ | ❌/❌ | ✔️/✔️ | ❌/❌ | ✔️/❌ | ❌ | ✔️/✔️ |
Video Input/Output | ✔️/✔️ | ✔️/✔️ | ✔️/✔️ | ❌/❌ | ❌/❌ | ✔️/❌ | ❌ | ✔️/✔️ |
Voice Output | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✔️ |
Multimodal Interleaved Output | ❌ | ❌ | ✔️ | ❌ | ❌ | ❌ | ❌ | ✔️ |
Modeling | CICO | CICO | DIDO | DIDO | DIDO | CIDO | AR+Diff | DIDO |
Of course. 🤗
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (2024)
- Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (2024)
- LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal! (2024)
- SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs (2024)
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper