Diffusion Adversarial Post-Training for One-Step Video Generation
Abstract
The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
Community
Very cool!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Accelerating Video Diffusion Models via Distribution Matching (2024)
- Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising (2025)
- CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models (2025)
- DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization (2024)
- SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training (2024)
- CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training (2024)
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
A couple years ago I was advising someone building AI-content detection techniques, and one of the most consistent approaches was to look at heavy bokeh (the out if focus areas in a photo taken with a wide aperture/small depth of field). It's wild seeing each new generation or development, still refuses to even attempt to learn one of the core high-end photography concepts.
Here's a short list (but not exhaustive, I'm sure a lot of my photography and writing has been taken and used in datasets for free, I hate doing free work for yall).
The bokeh you're generating isn't transitioning properly. If something contrasty, or distinct objects (flowers, lights, people, a car etc) 15ft away at the center of the frame, of a 50mm lens, it will capture 15ft or 20ft of width where that object is.. but when that object runs horizontally, the bokeh will not remain consistent. It's gaining distance from the lens, causing more blended, soupy bokeh.
The absolute, dead giveaway that a portrait is made with AI... it has the bokeh rendition and vignetting of a lens used at f/1.2 - f/1.8, and you're starting to learn about mechanical vignetting (proud of you guys!! Only took the better part of a decade!), but hilariously, compressed bokeh typical towards the edge of the image (lemon shaped), will not consistently change shape linearly related to their position in the image. Sometimes it's because there's lots of a certain hue of bokeh in center frame, so it assumes those hues must suffer less mechanical/optical vignetting..
Yeah, get back to work, code monkeys, your toy is still a mess.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper