OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
Abstract
OmniInsert addresses challenges in mask-free video insertion using a novel data pipeline, feature injection, progressive training, and context-aware rephrasing, outperforming commercial solutions.
Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.
Community
🔥Make Video Insertion Easy Now🔥
We present OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references.
Highlights:
Technology. 1) We develop InsertPipe, a systematic data curation framework featuring multiple data pipelines to automatically generate high-quality and diverse data; 2) We propose OmniInsert, a unified mask-free architecture capable of seamlessly inserting both single and multiple reference subjects into videos; 3) We introduce InsertBench, a comprehensive benchmark tailored for MVI task.
Significance. 1) OmniInsert demonstrates superior generation quality, bridging the gap between academic research and commercial-grade applications; 2) We present a comprehensive study of the MVI task—including data, model, and benchmark—will be publicly released to support future research and development.
[code and demo will be released🚀]
Project page: https://phantom-video.github.io/OmniInsert/
Code: https://github.com/Phantom-video/OmniInsert
Paper: https://arxiv.org/abs/2509.17627
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework (2025)
- DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing (2025)
- SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation (2025)
- Lynx: Towards High-Fidelity Personalized Video Generation (2025)
- LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation (2025)
- PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation (2025)
- ROSE: Remove Objects with Side Effects in Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper