File size: 5,856 Bytes
17fb6ce 8560cb7 17fb6ce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
license: openrail++
---
# CascadeV | An Implemention of Würstchen architecture for High-Resolution Video Generation
## News
**[2024.07.17]** We release the [code](https://github.com/bytedance/CascadeV) and pretrained [weights](https://huggingface.co/ByteDance/CascadeV) of a DiT-based video VAE, which supports video reconstruction with a high compression factor (1x32x32=1024). The T2V model is still on the way.
## Introduction
CascadeV is a video generation pipeline built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture. By using a highly compressed latent representation, we can generate longer videos with higher resolution.
## Video VAE
Comparison of Our Cascade Approach with Other VAEs (on Latent Space of Shape 8x32x32)
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/compare.png" />
Video Recontruction: Original (left) vs. Reconstructed (right) | *Click to view the videos*
<table class="center">
<tr>
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/1.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/1.jpg" /></a></td>
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/2.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/2.jpg" /></a></td>
</tr>
<tr>
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/3.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/3.jpg" /></a></td>
<td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/4.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/4.jpg" /></a></td>
</tr>
</table>
### 1. Model Architecture
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/arch.jpg" />
#### 1.1 DiT
We use [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma) as our base model with the following modifications:
* Replace the original VAE (of [SDXL](https://arxiv.org/abs/2307.01952)) with the one from [Stable Video Diffusion](https://github.com/Stability-AI/generative-models).
* Use sematic compressor from [StableCascade](https://github.com/Stability-AI/StableCascade) to provide the low-resolution latent input.
* Remove text encoder and all multi-head cross-attention layers since we are not using text condition.
* Replace all 2D attention layers to 3D. We find that 3D attention outperforms 2+1D (i.e. alternative spatial and temporal attention), especially in temporal consistency.
Comparison of 2+1D Attention (left) vs. 3D Attention (right)
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/2d1d_vs_3d.gif" />
#### 1.2. Grid Attention
Using 3D attention requires much more computational resources than 2D/2+1D, especially with higher resolution. As a compromise solution, we replace some 3D attention layers with alternative spatial and temporal grid attention.
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/grid.jpg" />
### 2. Evaluation
Dataset: We perform qualitative comparison with other baselines on the dataset [Inter4K](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html), by sampling the first 200 videos from the Inter4K to create a video dataset with a resolution of 1024x1024 and 30 FPS.
Metrics: We use PSNR, SSIM and LPIPS to evaluate the per-frame quality (and the similarity between original and reconstructed video) and [VBench](https://github.com/Vchitect/VBench) to evaluate the video quality independently.
#### 2.1 PSNR/SSIM/LPIPS
Diffusion-based VAEs (like StableCascade and our model) performs poorly in reconstruction metrics, due to their ability to produce videos with more fine-grained details but less similiar to the original ones.
| Model/Compression Factor | PSNR↑ | SSIM↑ | LPIPS↓ |
| -- | -- | -- | -- |
| Open-Sora-Plan v1.1/4x8x8=256 | 25.7282 | 0.8000 | 0.1030 |
| EasyAnimate v3/4x8x8=256 | **28.8666** | **0.8505** | **0.0818** |
| StableCascade/1x32x32=1024 | 24.3336 | 0.6896 | 0.1395 |
| Ours/1x32x32=1024 | 23.7320 | 0.6742 | 0.1786 |
#### 2.2 VBench
Our approach has comparable performance to the previous VAEs in both frame-wise and temporal quality even with much larger compression factor.
| Model/Compression Factor | Subject Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Imaging Quality | Aesthetic Quality |
| -- | -- | -- | -- | -- | -- | -- |
| Open-Sora-Plan v1.1/4x8x8=256 | 0.9519 | 0.9618 | 0.9573 | 0.9789 | 0.6791 | 0.5450 |
| EasyAnimate v3/4x8x8=256 | 0.9578 | **0.9695** | 0.9615 | **0.9845** | 0.6735 | 0.5535 |
| StableCascade/1x32x32=1024 | 0.9490 | 0.9517 | 0.9430 | 0.9639 | **0.6811** | **0.5675** |
| Ours/1x32x32=1024 | **0.9601** | 0.9679 | **0.9626** | 0.9837 | 0.6747 | 0.5579 |
### 3. Usage
#### 3.1 Installation
Recommend to use Conda
```
conda create -n cascadev python==3.9.0
conda activate cascadev
```
Install [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma)
```
bash install.sh
```
#### 3.2 Download Pretrained Weights
```
bash pretrained/download.sh
```
#### 3.3 Video Reconstruction
A sample script for video reconstruction with compression factor of 32
```
bash recon.sh
```
Results of Video Reconstruction: w/o LDM (left) vs. w/ LDM (right)
<img src="https://code.byted.org/data/CascadeV/raw/master/docs/w_vs_wo_ldm.png" />
*It takes almost 1 minutes to reconstruct a video of shape 8x1024x1024 with one NVIDIA-A800*
#### 3.4 Train VAE
* Replace "video_list" in configs/s1024.effn-f32.py with your own video datasets
* Then run
```
bash train_vae.sh
```
## Acknowledgement
* [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma): The **main codebase** we built upon.
* [StableCascade](https://github.com/Stability-AI/StableCascade): Würstchen architecture we used.
* Thanks [Stable Video Diffusion](https://github.com/Stability-AI/generative-models) for its amazing Video VAE. |