arxiv:2409.17692

MIO: A Foundation Model on Multimodal Tokens

Published on Sep 26

· Submitted by

ZenMoore on Sep 30

#2 Paper of the day

Upvote

Authors:

Zekun Wang ,

King Zhu ,

Chunpu Xu ,

Wangchunshu Zhou ,

Jiaheng Liu ,

Ning Shi ,

Zhaoxiang Zhang ,

Yuanxing Zhang ,

Ge Zhang ,

Jie Fu ,

Wenhao Huang

Abstract

In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

View arXiv page View PDF Add to collection

Community

ZenMoore

Paper author Paper submitter Sep 30

MIO is a foundation model integrating both multimodal understanding and generation. It can support four modalities: image, video (frame sequence), speech, and text. MIO natively supports multimodal interleaved output and context-aware image generation (in contrast with descriptive image generation). It is also enhanced for interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

Models	Emu1	Emu2	SEED-LLaMA	AnyGPT	CM3Leon, Chameleon	Gemini	Transfusion	MIO (ours)
I/O Consistency	❌	✔️	✔️	✔️	✔️	❌	❌	✔️
Unified Bidirectional SFT	❌	❌	✔️	✔️	✔️	✔️	❌	✔️
Multi-Task SFT	✔️	✔️	✔️	❌	✔️	✔️	❌	✔️
Speech Input/Output	❌/❌	❌/❌	❌/❌	✔️/✔️	❌/❌	✔️/❌	❌	✔️/✔️
Video Input/Output	✔️/✔️	✔️/✔️	✔️/✔️	❌/❌	❌/❌	✔️/❌	❌	✔️/✔️
Voice Output	❌	❌	❌	❌	❌	❌	❌	✔️
Multimodal Interleaved Output	❌	❌	✔️	❌	❌	❌	❌	✔️
Modeling	CICO	CICO	DIDO	DIDO	DIDO	CIDO	AR+Diff	DIDO