Papers
arxiv:2510.06590

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Published on Oct 8
· Submitted by zheng on Oct 9
#2 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

MingTok, a continuous latent space visual tokenizer, unifies vision-language understanding and generation within an autoregressive framework, achieving state-of-the-art performance across both domains.

AI-generated summary

Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

Community

Paper author Paper submitter

Introducing Ming‑UniVision & MingTok — the first autoregressive model to natively unify vision understanding & generation in a continuous unified representation space.

Code: https://github.com/inclusionAI/Ming-UniVision
Blog: https://inclusionai.github.io/blog/mingtok/
Modelscope: https://www.modelscope.cn/models/inclusionAI/Ming-UniVision-16B-A3B
Huggingface: https://huggingface.co/inclusionAI/Ming-UniVision-16B-A3B

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.06590 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.06590 in a Space README.md to link it from this page.

Collections including this paper 9