File size: 4,153 Bytes

1b3640e
0a2abec
 
1b3640e
 
 
 
 
 
0a2abec
 
 
1b3640e
0a2abec
4f5b1b8
1b3640e
0a2abec
 
1b3640e
 
0a2abec
 
 
 
1b3640e
 
 
4f5b1b8
 
 
 
 
 
 
1b3640e
 
 
 
 
 
 
 
 
 
 
 
 
 
e62242c
1b3640e
 
 
e62242c
1b3640e
9f28bb8
1b3640e
 
9f28bb8

---
base_model:
- wusize/Harmon-1_5B
datasets:
- jackyhate/text-to-image-2M
- BLIP3o/BLIP3o-60k
language:
- en
- zh
license: apache-2.0
pipeline_tag: text-to-image
library_name: diffusers
---

# Harmon-1.5B-RecA-plus

The model was presented in the paper [Reconstruction Alignment Improves Unified Multimodal Models](https://huggingface.co/papers/2509.07295).

> A self-supervised training framework that aligns understanding and generation in modest compute, with huge **zero-shot** gain on generation and editing capability.

This repository hosts the model weights for **Harmon-1.5B-RecA-plus**. For installation, usage instructions, and further documentation, please visit this project's [GitHub repository](https://github.com/HorizonWind2004/reconstruction-alignment).

## Abstract
Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs.

## 🧠 Method

  [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2509.07295)
  [![ArXiv](https://img.shields.io/badge/arXiv-A42C25?style=for-the-badge&logo=arxiv&logoColor=white&color=blue)](https://arxiv.org/abs/2509.07295)
  [![Github](https://img.shields.io/badge/RecA-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/HorizonWind2004/reconstruction-alignment)
  [![Hugging Face Collection](https://img.shields.io/badge/HF_Models-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/collections/sanaka87/realign-68ad2176380355a3dcedc068)
  [![HF Demo](https://img.shields.io/badge/Demo_(BAGEL)-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/spaces/sanaka87/BAGEL-ReAlign)
  [![Project Page](https://img.shields.io/badge/Project_Page-00CED1?style=for-the-badge&logo=web&logoColor=white)](https://reconstruction-alignment.github.io/)


## 📊 Benchmarks

### 1. Visual Understanding

Remains Unchanged.

### 2. Text-to-Image Generation 

We test it on 1024x1024 resolution.

| Model        | GenEval ↑ | DPGBench ↑ | WISE ↑ |
| ------------ | --------- | --------- | --------- |
| **Harmon-1.5B**    | 0.73  | 80.93  | 0.50 |
| **Harmon-1.5B-RecA-plus**    | **0.90**  | **88.15** | **0.52** |

## License

Harmon-1.5B-RecA-plus is licensed under the Apache 2.0 license. 


## ✍️ Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation~

@misc{xie2025reconstructionalignmentimprovesunified,
      title={Reconstruction Alignment Improves Unified Multimodal Models}, 
      author={Ji Xie and Trevor Darrell and Luke Zettlemoyer and XuDong Wang},
      year={2025},
      eprint={2509.07295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.07295}, 
}