| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - Alex11556666/Reason_Tuning |
| | base_model: |
| | - Qwen/Qwen2.5-VL-3B-Instruct |
| | pipeline_tag: text-to-image |
| | --- |
| | |
| | # π‘ DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing |
| | <p align="left"> |
| | <a href="http://arxiv.org/abs/2602.12205"> |
| | <img |
| | src="https://img.shields.io/badge/DeepGen 1.0-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;" |
| | alt="DeepGen 1.0 Paper on arXiv" |
| | /> |
| | </a> |
| | <a href="https://github.com/deepgenteam/deepgen" target="_blank" style="margin: 2px;"> |
| | <img |
| | alt="Github" src="https://img.shields.io/badge/DeepGen 1.0-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;" |
| | alt="DeepGen 1.0 Codebase" |
| | /> |
| | </a> |
| | <a href="https://deepgenteam.github.io/" target="_blank" style="margin: 2px;"> |
| | <img |
| | alt="Github" src="https://img.shields.io/badge/Website-project page-orange" style="display: inline-block; vertical-align: middle;" |
| | alt="DeepGen 1.0 page" |
| | /> |
| | </a> |
| | </p> |
| | DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation. |
| | <p align="left"><img src="bubble_chart.png" width="80%"></p> |
| | |
| | ## π§ Method |
| | Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. |
| | To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance. |
| | We further design a data-centric training strategy spanning three progressive stages: (1) **Alignment Pre-training** on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) **Joint Supervised Fine-tuning** on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) **Reinforcement Learning with MR-GRPO**, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. |
| |
|
| | <p align="left"><img src="arch.png" width="80%"></p> |
| |
|
| | ## π Benchmarks |
| |
|
| | ### 1. General Image Generation |
| | | Model | Params | Geneval β | DPGBench β | UniGenBench β | |
| | | --------------------- | ----------- | ----------- | ------------ | ------------- | |
| | | OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 | |
| | | BAGEL | 14B | 0.82 | 85.10 | 61.53 | |
| | | X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 | |
| | | Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 | |
| | | Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β | |
| | | Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ | |
| | | LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β | |
| | | Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 | |
| | | GLM-Image | 9B + 7B | β | 84.78 | β | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 π₯ | 87.05 | 74.18 π₯ | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ | |
| |
|
| |
|
| |
|
| | ### 2. General Image Editing |
| |
|
| | | Model | Params | GEdit-EN β | ImgEdit β | |
| | | :--- | :--- | :--- | :--- | |
| | | BAGEL | 14B | 6.52 | 3.20 | |
| | | Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ | |
| | | LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ | |
| | | Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 π₯ | 4.14 π₯ | |
| |
|
| | ### 3. Reasoning Image Generation |
| | | Model | Params | WISE β | T2I-CoREBench β | |
| | | :--- | :--- | :--- | :--- | |
| | | OmniGen2 | 3B + 4B | 0.47 | 36.1 | |
| | | BAGEL | 14B | 0.70 π₯ | 41.1 | |
| | | Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 | |
| | | Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ | |
| | | LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ | |
| | | Z-Image-Turbo | 4B + 6B | - | 43.7 | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 π₯ | 45.7 | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 π₯ | 46.5 π₯ | |
| |
|
| | ### 4. Reasoning Image Editing |
| |
|
| | | Model | Params | RISE β | UniREditBench β | |
| | | :--- | :--- | :--- | :--- | |
| | | OmniGen2 | 3B + 4B | - | 43.4 | |
| | | BAGEL | 14B | 11.9 π₯ | 51.0 | |
| | | Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 π₯ | 77.5 π₯ | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 π₯ | 75.7 π₯ | |
| |
|
| | ## π¨ Quantitative results |
| | <p align="left"><img src="teaser.png" width="80%"></p> |
| |
|
| | ## π οΈ Usage |
| |
|
| | ### Merge ZIP Files |
| | To use the DeepGen checkpoints, please merge the sharded model files first. We release Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints. |
| |
|
| | ```bash |
| | # Merge zip |
| | cat DeepGen_CKPT.zip.part-* > DeepGen_CKPT.zip |
| | # Unzip DeepGen checkpoints |
| | unzip DeepGen_CKPT.zip |
| | ``` |
| |
|
| | ```text |
| | checkpoints/ |
| | βββ DeepGen_CKPT |
| | βββPretrainβββiter_200000.pth |
| | βββ SFTβββiter_400000.pth |
| | βββRLβββMR-GDPO_final.pt |
| | |
| | ``` |
| | if you want only final model state please use `model.pt` directly , it is same as `MR-GDPO_final.pt` |
| |
|
| | ## β Citation |
| | ```bibtex |
| | @article{wang2026deepgen, |
| | title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing}, |
| | author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others}, |
| | journal={arXiv preprint arXiv:2602.12205}, |
| | year={2026} |
| | } |
| | ``` |