arxiv:2510.06673

Heptapod: Language Modeling on Visual Signals

Published on Oct 8

· Submitted by

yongxinzhu on Oct 9

ByteDance Seed

Upvote

Authors:

Yongxin Zhu ,

Abstract

Heptapod, an image autoregressive model using causal attention and next 2D distribution prediction, achieves superior performance on ImageNet generation by combining sequential modeling with holistic self-supervised learning.

AI-generated summary

We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs causal attention, eliminates reliance on CFG, and eschews the trend of semantic tokenizers. Our key innovation is next 2D distribution prediction: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of 2.70, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

View arXiv page View PDF Add to collection

Community

librarian-bot

6 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.06673 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.06673 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.06673 in a Space README.md to link it from this page.