Unified Video Editing with Temporal Reasoner
Abstract
VideoCoF, a Chain-of-Frames approach, improves video editing precision and instruction-to-region mapping by using reasoning tokens without requiring user-provided masks.
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
Community
A Chain of Frames video editing method enbale temporal reasoning and 4x video length extrapolation with just 50k training pairs!
๐ Page: videocof.github.io/
๐ Paper: arxiv.org/abs/2512.07469
๐ป Code: github.com/knightyxp/VideoCoF
Dear @XiangpengYang in the attached video, there is a missing S at the front cover in the word Reasoner.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization (2025)
- Are Image-to-Video Models Good Zero-Shot Image Editors? (2025)
- In-Context Learning with Unpaired Clips for Instruction-based Video Editing (2025)
- Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation (2025)
- iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation (2025)
- Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer (2025)
- Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper