Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Abstract
Ditto framework addresses data scarcity in instruction-based video editing by generating a large dataset and using a curriculum learning strategy to train Editto, achieving superior instruction-following ability.
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.
Community
Project page: https://editto.net
Code: https://github.com/EzioBy/Ditto
Dataset: https://huggingface.co/datasets/QingyanBai/Ditto-1M
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- In-Context Learning with Unpaired Clips for Instruction-based Video Editing (2025)
- ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation (2025)
- MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks (2025)
- EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection (2025)
- PickStyle: Video-to-Video Style Transfer with Context-Style Adapters (2025)
- UniVideo: Unified Understanding, Generation, and Editing for Videos (2025)
- Factuality Matters: When Image Generation and Editing Meet Structured Visuals (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend