How to generate a video in 4 seconds?
I tried to use this pipeline, but the output videos are up to 2 seconds long. I changed frame, fps, steps, but I can’t increase the length. The code contains the callback_on_step_end:
(Callable
, optional) parameter: A function that calls at the end of each denoising steps during the inference. The function is called
with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
. callback_kwargs
will include a list of all tensors as specified by callback_on_step_end_tensor_inputs
.
But it is not clear how to use it. Tell me how to generate a video in 4 seconds?
To generate a 4 second long video (which is what I'm guessing you mean), change the frame rate parameter (fps) in the "export_to_video" function call.
The model will generate 25 frames (by default -- and what it's fine-tuned to do). If you use fps=25 as the parameter for your model call, and 25 fps as the parameter for the call to "export_to_video", the video produced will have a 1 second duration.
Alternatively, if you hold fps=25 as the parameter for the model call, but export to video using an fps of 6.25 (ie 25/4), the resulting video will have a duration of 4 seconds. However, the output will almost certainly be 'choppy', due to large perceptual jumps between frames (humans need ~20fps to perceive a smooth video)
Possible solutions:
One solution is to chain together multiple 1s videos (use the final frame of each preceding video as the starting frame of the next video), however, since the model does not have any insight into the motion of the preceding videos, the output may not have perfect coherence.
Instead, this model could be fine-tuned to produce videos given a more robust context than a single start frame, similar to the 2 frame input discussed in https://arxiv.org/abs/2304.08818.
Using this pretrained model without fine tuning, you might try to pass additional end frames, encode to latent space, and prevent alteration of these 'seed frames' during diffusion to extend the duration of the video while improving on consistency between frame group motion. (I've tried this, and it can sufficiently produce 4-5 second clips, but not much more than that due to compounding error).