arxiv:2508.04825

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

Published on Aug 6

· Submitted by

RyanL22 on Aug 11

#2 Paper of the day

Upvote

Authors:

Seungyong Lee ,

Jeong-gi Kwak

Abstract

Voost, a unified diffusion transformer framework, jointly learns virtual try-on and try-off, enhancing garment-body correspondence and achieving state-of-the-art results across benchmarks.

AI-generated summary

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

View arXiv page View PDF Project page GitHub 307 Add to collection

Community

RyanL22

Paper author Paper submitter 25 days ago

•

edited 17 days ago

By learning the two tasks jointly, we enable scalable training and significantly enhance garment–body correspondence. Voost achieves this without any task-specific architectural changes or loss modifications, unlike prior approaches that rely on separate networks or additional labels.
As a result, Voost delivers state-of-the-art performance on both try-on and try-off benchmarks — and notably, it also works robustly on in-the-wild images with diverse poses, backgrounds, lighting conditions, and garment categories.

📄 Arxiv paper: https://arxiv.org/abs/2508.04825
🌐 Project page: https://nxnai.github.io/Voost/
💻 Public demo: https://huggingface.co/spaces/NXN-Labs/Voost