UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
Paper
β’
2601.03193
β’
Published
β’
44
While Unified Multimodal Models (UMMs) excel at comprehension, they often suffer from Conduction Aphasia: the inability to translate internal knowledge into faithful generation.
UniCorn is a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. It partitions a single UMM into three collaborative rolesβProposer, Solver, and Judgeβto distill latent understanding into explicit generative signals via self-play.
To optimize generation quality and avoid common pitfalls like blurriness, follow these hyperparameter guidelines:
cfg_text_scale: Use 4.0β8.0 for balanced prompt following.cfg_renorm_type: Use global for general Text-to-Image tasks.timestep_shift: Higher values for better layout; lower values for finer details.num_timesteps: Standard setting is 50.UniCorn achieves substantial gains over base models (e.g., +6.5 on OneIG, +5.0 on WISE).
| Model | TIIF (Short/Long) | WISE (Overall) | OneIG-EN (Overall) | CompBench (Overall) | DPG (Score) | Geneval (Score) |
|---|---|---|---|---|---|---|
| BAGEL | 71.0 / 71.8 | 50.0 | 36.1 | 82.2 | 84.0 | 78.0 |
| UniCorn | 74.7 / 72.9 | 55.0 | 42.6 | 88.5 | 86.8 | 82.0 |
| $\Delta$(vs. BAGEL) | +3.7 / +1.1 | +5.0 | +6.5 | +6.3 | +2.8 | +4.0 |
@article{han2026unicorn,
title={UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision},
author={Han, Ruiyan and Fang, Zhen and Sun, Xinyu and Ma, Yuchen and Wang, Ziheng and Zeng, Yu and Chen, Zehui and Chen, Lin and Huang, Wenxuan and Xu, Wei-Jie and others},
journal={arXiv preprint arXiv:2601.03193},
year={2026}
}
This project is licensed under the Apache 2.0 License.