Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Abstract
Voost, a unified diffusion transformer framework, jointly learns virtual try-on and try-off, enhancing garment-body correspondence and achieving state-of-the-art results across benchmarks.
Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.
Community
By learning the two tasks jointly, we enable scalable training and significantly enhance garmentβbody correspondence. Voost achieves this without any task-specific architectural changes or loss modifications, unlike prior approaches that rely on separate networks or additional labels.
As a result, Voost delivers state-of-the-art performance on both try-on and try-off benchmarks β and notably, it also works robustly on in-the-wild images with diverse poses, backgrounds, lighting conditions, and garment categories.
π Arxiv paper: https://arxiv.org/abs/2508.04825
π Project page: https://nxnai.github.io/Voost/
π» Public demo: https://huggingface.co/spaces/NXN-Labs/Voost
Congratulations on the great work,
@RyanL22
!
Do you also plan to release the training code?
For readers interested, here is a curated list of all works on VTOFF:
https://github.com/rizavelioglu/awesome-virtual-try-off/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On (2025)
- One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose (2025)
- OmniVTON: Training-Free Universal Virtual Try-On (2025)
- DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework (2025)
- IC-Custom: Diverse Image Customization via In-Context Learning (2025)
- Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis (2025)
- FW-VTON: Flattening-and-Warping for Person-to-Person Virtual Try-on (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
πππ Voost Demo is Now Live!
π₯ Try it yourself here : https://huggingface.co/spaces/NXN-Labs/Voost
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper