Could Image Editing Create Animations?
I were so impressed by capabilities of model, that I started to think what else it can do. So, here is suggestion, if anyone willing to listen: it should be possible to use input image as an initial frame (T0) of the sequence, and output "edited image" would be "keyframes grid", e.g. 4x4 grid, where we have predicted low-resolution keyframes T+1, T+2, etc in rows and columns. Then we split this grid into sequence, perform upscaling and frame interpolation to get video. At least, I think that should be possible.
Actually, we can add audio into mix. After all, audio can be encoded into image/grid. Even if it would be visible, that's fine for intermediate representation. Just separate as a part of post-processing, should be comparable to "background removal" in a sense.