Papers
arxiv:2508.04825

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

Published on Aug 6
Β· Submitted by RyanL22 on Aug 11
#2 Paper of the day

Abstract

Voost, a unified diffusion transformer framework, jointly learns virtual try-on and try-off, enhancing garment-body correspondence and achieving state-of-the-art results across benchmarks.

AI-generated summary

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Community

Paper author Paper submitter
β€’
edited 17 days ago

By learning the two tasks jointly, we enable scalable training and significantly enhance garment–body correspondence. Voost achieves this without any task-specific architectural changes or loss modifications, unlike prior approaches that rely on separate networks or additional labels.
As a result, Voost delivers state-of-the-art performance on both try-on and try-off benchmarks β€” and notably, it also works robustly on in-the-wild images with diverse poses, backgrounds, lighting conditions, and garment categories.

πŸ“„ Arxiv paper: https://arxiv.org/abs/2508.04825
🌐 Project page: https://nxnai.github.io/Voost/
πŸ’» Public demo: https://huggingface.co/spaces/NXN-Labs/Voost

Congratulations on the great work, @RyanL22 !
Do you also plan to release the training code?

For readers interested, here is a curated list of all works on VTOFF:
https://github.com/rizavelioglu/awesome-virtual-try-off/

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author
β€’
edited 16 days ago

πŸŽ‰πŸŽ‰πŸŽ‰ Voost Demo is Now Live!

πŸ”₯ Try it yourself here : https://huggingface.co/spaces/NXN-Labs/Voost

image.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.04825 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.04825 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 10