File size: 5,856 Bytes
17fb6ce
 
 
8560cb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17fb6ce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: openrail++
---
# CascadeV | An Implemention of Würstchen architecture for High-Resolution Video Generation

## News

**[2024.07.17]** We release the [code](https://github.com/bytedance/CascadeV) and pretrained [weights](https://huggingface.co/ByteDance/CascadeV) of a DiT-based video VAE, which supports video reconstruction with a high compression factor (1x32x32=1024). The T2V model is still on the way.

## Introduction

CascadeV is a video generation pipeline built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture. By using a highly compressed latent representation, we can generate longer videos with higher resolution.

## Video VAE

Comparison of Our Cascade Approach with Other VAEs (on Latent Space of Shape 8x32x32)

<img src="https://code.byted.org/data/CascadeV/raw/master/docs/compare.png" />

Video Recontruction: Original (left) vs. Reconstructed (right) | *Click to view the videos*

<table class="center">
<tr>
  <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/1.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/1.jpg" /></a></td>
  <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/2.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/2.jpg" /></a></td>
</tr>
<tr>
  <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/3.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/3.jpg" /></a></td>
  <td><a href='https://code.byted.org/data/CascadeV/raw/master/docs/4.mp4'><img src="https://code.byted.org/data/CascadeV/raw/master/docs/4.jpg" /></a></td>
</tr>
</table>

### 1. Model Architecture

<img src="https://code.byted.org/data/CascadeV/raw/master/docs/arch.jpg" />

#### 1.1 DiT

We use [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma) as our base model with the following modifications:

* Replace the original VAE (of [SDXL](https://arxiv.org/abs/2307.01952)) with the one from [Stable Video Diffusion](https://github.com/Stability-AI/generative-models).
* Use sematic compressor from [StableCascade](https://github.com/Stability-AI/StableCascade) to provide the low-resolution latent input. 
* Remove text encoder and all multi-head cross-attention layers since we are not using text condition.
* Replace all 2D attention layers to 3D. We find that 3D attention outperforms 2+1D (i.e. alternative spatial and temporal attention), especially in temporal consistency.

Comparison of 2+1D Attention (left) vs. 3D Attention (right)

<img src="https://code.byted.org/data/CascadeV/raw/master/docs/2d1d_vs_3d.gif" />

#### 1.2. Grid Attention

Using 3D attention requires much more computational resources than 2D/2+1D, especially with higher resolution. As a compromise solution, we replace some 3D attention layers with alternative spatial and temporal grid attention.

<img src="https://code.byted.org/data/CascadeV/raw/master/docs/grid.jpg" />

### 2. Evaluation

Dataset: We perform qualitative comparison with other baselines on the dataset [Inter4K](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html), by sampling the first 200 videos from the Inter4K to create a video dataset with a resolution of 1024x1024 and 30 FPS.

Metrics: We use PSNR, SSIM and LPIPS to evaluate the per-frame quality (and the similarity between original and reconstructed video) and [VBench](https://github.com/Vchitect/VBench) to evaluate the video quality independently.

#### 2.1 PSNR/SSIM/LPIPS

Diffusion-based VAEs (like StableCascade and our model) performs poorly in reconstruction metrics, due to their ability to produce videos with more fine-grained details but less similiar to the original ones.

| Model/Compression Factor | PSNR↑ | SSIM↑ | LPIPS↓ |
| -- | -- | -- | -- |
| Open-Sora-Plan v1.1/4x8x8=256 | 25.7282 | 0.8000 | 0.1030 |
| EasyAnimate v3/4x8x8=256 | **28.8666** | **0.8505** | **0.0818** |
| StableCascade/1x32x32=1024 | 24.3336 | 0.6896 | 0.1395 |
| Ours/1x32x32=1024 | 23.7320 | 0.6742 | 0.1786 |

#### 2.2 VBench

Our approach has comparable performance to the previous VAEs in both frame-wise and temporal quality even with much larger compression factor.  

| Model/Compression Factor | Subject Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Imaging Quality | Aesthetic Quality |
| -- | -- | -- | -- | -- | -- | -- |
| Open-Sora-Plan v1.1/4x8x8=256 | 0.9519 | 0.9618 | 0.9573 | 0.9789 | 0.6791 | 0.5450 |
| EasyAnimate v3/4x8x8=256 | 0.9578 | **0.9695** | 0.9615 | **0.9845** | 0.6735 | 0.5535 |
| StableCascade/1x32x32=1024 | 0.9490 | 0.9517 | 0.9430 | 0.9639 | **0.6811** | **0.5675** |
| Ours/1x32x32=1024 | **0.9601** | 0.9679 | **0.9626** | 0.9837 | 0.6747 | 0.5579 |

### 3. Usage

#### 3.1 Installation

Recommend to use Conda

```
conda create -n cascadev python==3.9.0
conda activate cascadev
```

Install [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma)

```
bash install.sh
```

#### 3.2 Download Pretrained Weights

```
bash pretrained/download.sh
```

#### 3.3 Video Reconstruction

A sample script for video reconstruction with compression factor of 32

```
bash recon.sh
```

Results of Video Reconstruction: w/o LDM (left) vs. w/ LDM (right)

<img src="https://code.byted.org/data/CascadeV/raw/master/docs/w_vs_wo_ldm.png" />

*It takes almost 1 minutes to reconstruct a video of shape 8x1024x1024 with one NVIDIA-A800*

#### 3.4 Train VAE

* Replace "video_list" in configs/s1024.effn-f32.py with your own video datasets
* Then run

```
bash train_vae.sh
```

## Acknowledgement
* [PixArt-Σ](https://github.com/PixArt-alpha/PixArt-sigma): The **main codebase** we built upon.
* [StableCascade](https://github.com/Stability-AI/StableCascade): Würstchen architecture we used.
* Thanks [Stable Video Diffusion](https://github.com/Stability-AI/generative-models) for its amazing Video VAE.