Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,227 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# AsymmetricMagVitV2
|
2 |
+
Lightweight open-source reproduction of MagVitV2, fully aligned with the paper’s functionality. Supports image and video joint encoding and decoding, as well as videos of arbitrary length and resolution.
|
3 |
+
|
4 |
+
* All spatio-temporal operators are implemented using causal 3D to avoid video instability caused by 2D+1D, ensures that the FVD does not sudden increases.
|
5 |
+
* The Encoder and Decoder support arbitrary resolutions, support auto-regressive inference for arbitrary durations.
|
6 |
+
* Training employs multi-resolution and dynamic-duration mixed training, allowing decoding of videos with arbitrary odd frames as long as GPU memory permits, demonstrating temporal extrapolation capability.
|
7 |
+
* The model is closely aligned with MagVitV2 but with reduced parameter, particularly in the lightweight Encoder, reducing the burden of caching VAE features.
|
8 |
+
|
9 |
+
|
10 |
+
|
11 |
+
## Demo
|
12 |
+
|
13 |
+
|
14 |
+
### 16 channel VAE image reconstruction
|
15 |
+
|
16 |
+
|
17 |
+
<div style="display: flex; justify-content: space-between;">
|
18 |
+
<div style="flex: 1; padding-right: 5px;">
|
19 |
+
<a href="https://upos-sz-mirrorbd.bilivideo.com/upgcxcode/49/26/500001606242649/500001606242649-1-192.mp4?e=ig8euxZM2rNcNbRVhwdVhwdlhWdVhwdVhoNvNC8BqJIzNbfq9rVEuxTEnE8L5F6VnEsSTx0vkX8fqJeYTj_lta53NCM=&uipk=5&nbs=1&deadline=1720214786&gen=playurlv2&os=bdbv&oi=2584261250&trid=bbf4c7694b334e96b66f466568155cfbO&mid=0&platform=html5&og=hw&upsig=1fb8442f381fb7d42fce1f234299118e&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform,og&bvc=vod&nettype=1&orderid=0,3&buvid=&build=7330300&f=O_0_0&bw=71316&logo=80000000">
|
20 |
+
<img src="data/show/gif/vae_16z_bf16_sw_17_wukong.gif" alt="60s 3840x2160" style="width: 100%; height: auto;">
|
21 |
+
</a>
|
22 |
+
</div>
|
23 |
+
<div style="flex: 1; padding-left: 5px;">
|
24 |
+
<a href="https://upos-sz-mirrorcos.bilivideo.com/upgcxcode/20/29/500001606242920/500001606242920-1-192.mp4?e=ig8euxZM2rNcNbRghwdVhwdlhWNVhwdVhoNvNC8BqJIzNbfq9rVEuxTEnE8L5F6VnEsSTx0vkX8fqJeYTj_lta53NCM=&uipk=5&nbs=1&deadline=1720214870&gen=playurlv2&os=cosbv&oi=2584261250&trid=19d089cf9c3c402c83d4608511c50f60O&mid=0&platform=html5&og=cos&upsig=faeda47e09a67f389cf8ec67c7e0c17c&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform,og&bvc=vod&nettype=1&orderid=0,3&buvid=&build=7330300&f=O_0_0&bw=120172&logo=80000000">
|
25 |
+
<img src="data/show/gif/vae_16z_bf16_sw_17_tokyo_walk_h264_16s.gif" alt="60s 1920x1080" style="width: 100%; height: auto;">
|
26 |
+
</a>
|
27 |
+
</div>
|
28 |
+
</div>
|
29 |
+
|
30 |
+
* Converting MP4 to GIF may result in detail loss, pixelation, and incomplete duration. It is recommended to watch the original video for the best experience.
|
31 |
+
|
32 |
+
###### 60s 3840x2160
|
33 |
+
|
34 |
+
[bilibili_Black Myth:Wu KongULR 16zVAE](https://www.bilibili.com/video/BV1ULaPecEga/?spm_id_from=333.999.0.0&vd_source=681432e843390b0f7192d64fa4ed9613)
|
35 |
+
|
36 |
+
###### 60s 1920x1080
|
37 |
+
|
38 |
+
[bilibili_tokyo_walk ULR 16zVAE](https://www.bilibili.com/video/BV1mLaPecEXP/?spm_id_from=333.999.0.0&vd_source=681432e843390b0f7192d64fa4ed9613)
|
39 |
+
|
40 |
+
|
41 |
+
##### image reconstruction
|
42 |
+
|
43 |
+
<table>
|
44 |
+
<tr>
|
45 |
+
<td><img src="data/show/images/16z/mj_16z_1.png" alt="1" style="width:100%;"></td>
|
46 |
+
<td><img src="data/show/images/16z/mj_16z_2.png" alt="2" style="width:100%;"></td>
|
47 |
+
<td><img src="data/show/images/16z/mj_16z_3.png" alt="3" style="width:100%;"></td>
|
48 |
+
</tr>
|
49 |
+
<tr>
|
50 |
+
<td><img src="data/show/images/16z/mj_16z_4.png" alt="4" style="width:100%;"></td>
|
51 |
+
<td><img src="data/show/images/16z/mj_16z_5.png" alt="5" style="width:100%;"></td>
|
52 |
+
<td><img src="data/show/images/16z/mj_16z_6.png" alt="6" style="width:100%;"></td>
|
53 |
+
</tr>
|
54 |
+
<tr>
|
55 |
+
<td><img src="data/show/images/16z/mj_16z_7.png" alt="7" style="width:100%;"></td>
|
56 |
+
<td><img src="data/show/images/16z/mj_16z_8.png" alt="8" style="width:100%;"></td>
|
57 |
+
<td><img src="data/show/images/16z/mj_16z_9.png" alt="9" style="width:100%;"></td>
|
58 |
+
</tr>
|
59 |
+
</table>
|
60 |
+
|
61 |
+
|
62 |
+
## Contents
|
63 |
+
|
64 |
+
- [Installation](#installation)
|
65 |
+
- [Model Weights](#model-weights)
|
66 |
+
- [Metric](#metric)
|
67 |
+
- [Inference](#inference)
|
68 |
+
- [TODO List](#1)
|
69 |
+
- [Contact Us](#2)
|
70 |
+
- [Reference](#3)
|
71 |
+
|
72 |
+
### Installation
|
73 |
+
|
74 |
+
<a name="installation"></a>
|
75 |
+
|
76 |
+
#### 1. Clone the repo
|
77 |
+
|
78 |
+
```shell
|
79 |
+
git clonehttps://github.com/bornfly-detachment/AsymmetricMagVitV2.git
|
80 |
+
cd AsymmetricMagVitV2
|
81 |
+
```
|
82 |
+
|
83 |
+
#### 2. Setting up the virtualenv
|
84 |
+
|
85 |
+
This is assuming you have navigated to the `AsymmetricMagVitV2` root after cloning it.
|
86 |
+
|
87 |
+
|
88 |
+
```shell
|
89 |
+
# install required packages from pypi
|
90 |
+
python3 -m venv .pt2
|
91 |
+
source .pt2/bin/activate
|
92 |
+
pip3 install -r requirements/pt2.txt
|
93 |
+
```
|
94 |
+
|
95 |
+
|
96 |
+
### Model Weights
|
97 |
+
|
98 |
+
<details>
|
99 |
+
<summary>View more</summary>
|
100 |
+
|
101 |
+
| model | downsample (THW) | Encoder Size | Decoder Size|
|
102 |
+
|--------|--------|------|------|
|
103 |
+
|svd 2Dvae|1x8x8|34M|64M|
|
104 |
+
|AsymmetricMagVitV2|4x8x8|100M|159M|
|
105 |
+
|
106 |
+
|
107 |
+
| model | Data | #iterations | URL |
|
108 |
+
|------------------------|--------------|-------------|-----------------------------------------------------------------------|
|
109 |
+
| AsymmetricMagVitV2_4z |20M Intervid | 2node 1200k | [AsymmetricMagVitV2_4z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_4z) |
|
110 |
+
| AsymmetricMagVitV2_16z |20M Intervid | 4node 860k | [AsymmetricMagVitV2_16z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_16z) |
|
111 |
+
|
112 |
+
|
113 |
+
</details>
|
114 |
+
|
115 |
+
### Metric
|
116 |
+
|
117 |
+
<a name="Metric"></a>
|
118 |
+
|
119 |
+
|model|temporal-frame| fvd |fid|psnr|ssim|
|
120 |
+
|-----|----|-----------|--|----|----|
|
121 |
+
|SVD VAE|1 | 190.6 |1.8|28.2|1.0|
|
122 |
+
|openSoraPlan|1 | 249.8 |1.04|29.6|0.99|
|
123 |
+
|openSoraPlan|17 | 725.4 |3.17|23.4|0.89|
|
124 |
+
|openSoraPlan|33 | 756.8 |3.5|23|0.89|
|
125 |
+
|AsymmetricMagVitV2_16z|1 | 106.7 |0.2|36.3|1.0|
|
126 |
+
|AsymmetricMagVitV2_16z|17 | 131.4 |0.8|30.7|1.0|
|
127 |
+
|AsymmetricMagVitV2_16z|33 | 208.2 |1.4|28.9|1.0|
|
128 |
+
|
129 |
+
Note:
|
130 |
+
1. The test video is the original scale of data/videos/tokyo_walk.mp4. Previously, preprocessing with resize+CenterCrop256
|
131 |
+
resolution was also tested on a larger test set, and the results showed consistent trends. Now, it has been found
|
132 |
+
that high-resolution and original-sized videos pose the most challenging task for 3DVAE. Therefore, only this one video was tested,
|
133 |
+
configured at 8fps, and evaluated for the first 10 seconds.
|
134 |
+
2. The evaluation code can be referenced in models/evaluation.py. However, it has been a while since I last ran it,
|
135 |
+
and there have been modifications to the inference code. Calculating FID and FVD scores depends on the model,
|
136 |
+
original image preprocessing, inference hyperparameters, and the randomness introduced by sampling encoder KL.
|
137 |
+
As a result, scores cannot be accurately reproduced. Nonetheless, this can serve as a reference for designing
|
138 |
+
one’s own benchmark.
|
139 |
+
|
140 |
+
|
141 |
+
|
142 |
+
### Inference
|
143 |
+
|
144 |
+
|
145 |
+
#### Use AsymmetricMagVitV2 in your own code
|
146 |
+
|
147 |
+
```python
|
148 |
+
|
149 |
+
from models.vae import AsymmetricMagVitV2Pipline
|
150 |
+
import torch
|
151 |
+
from models.utils.image_op import imdenormalize, imnormalize, read_video, read_image
|
152 |
+
import torchvision.transforms as transforms
|
153 |
+
|
154 |
+
|
155 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
156 |
+
dtype = torch.bfloat16
|
157 |
+
encoder_init_window = 17
|
158 |
+
input_path = "data/videos/tokyo_walk.mp4"
|
159 |
+
img_transform = transforms.Compose([transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
|
160 |
+
input, last_frame_id = read_video(input_path, encoder_init_window, sample_fps=8, img_transform, start=0)
|
161 |
+
|
162 |
+
model = AsymmetricMagVitV2Pipline.from_pretrained("BornFly/AsymmetricMagVitV2_16z").to(device, dtype).eval()
|
163 |
+
init_z, reg_log = model.encode(input, encoder_init_window, is_init_image=True, return_reg_log=True, unregularized=False)
|
164 |
+
init_samples = model.decode(init_z.to(device, dtype), decode_batch_size=1, is_init_image=True)
|
165 |
+
|
166 |
+
```
|
167 |
+
|
168 |
+
#### High-resolution video encoding and decoding, greater than 720p(spatial-temporal slice)
|
169 |
+
|
170 |
+
|
171 |
+
##### About Encoder hyperparameter configuration
|
172 |
+
* slice frame spatial using: --max_siz --min_size
|
173 |
+
* slice video temporal using: --encoder_init_window --encoder_window
|
174 |
+
|
175 |
+
If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 256 and 512 at maximum.
|
176 |
+
|
177 |
+
##### About Decoder hyperparameter configuration
|
178 |
+
|
179 |
+
* slice latent spatial using: --min_latent_size --max_latent_size
|
180 |
+
|
181 |
+
(default GPU VRAM needs to exceed 28GB. If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 32=256p/8 and 64=512p/8 at maximum.)
|
182 |
+
|
183 |
+
* slice latent temporal using: --decoder_init_window,
|
184 |
+
|
185 |
+
5 frames of latent space corresponds to 17 frames of the original video.
|
186 |
+
The calculation formula is as follows: latent_T_dim = (frame_T_dim - 1) / temporal_downsample_num; in this model, temporal_downsample_num=4
|
187 |
+
|
188 |
+
|
189 |
+
##### 1. encode & decode video
|
190 |
+
|
191 |
+
```shell
|
192 |
+
python infer_vae.py --input_path data/videos/tokyo-walk.mp4 --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_videos > infer_vae_video.log 2>&1
|
193 |
+
```
|
194 |
+
|
195 |
+
##### 2. encode & decode image
|
196 |
+
|
197 |
+
```shell
|
198 |
+
python infer_vae.py --input_path data/images --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_images > infer_vae_image.log 2>&1
|
199 |
+
```
|
200 |
+
|
201 |
+
|
202 |
+
### TODO List
|
203 |
+
|
204 |
+
<p id="1"></p>
|
205 |
+
|
206 |
+
* Reproducing Sora, a 16-channel VAE integrated with SD3. Due to limited computational resources, the focus is on generating 1K high-definition dynamic wallpapers.
|
207 |
+
|
208 |
+
* Reproducing VideoPoet, supporting multimodal joint representation. Due to limited computational resources, the focus is on generating music videos.
|
209 |
+
|
210 |
+
|
211 |
+
### Contact Us
|
212 |
+
|
213 |
+
<p id="2"></p>
|
214 |
+
|
215 |
+
1. If there are any code-related questions, feel free to contact me via email——bornflyborntochange@outlook.com.
|
216 |
+
2. You need to scan the image to join the WeChat group or if it is expired, add this student as a friend first to invite you.
|
217 |
+
<img src="data/assets/mmqrcode1720196270375.png" alt="ding group" width="30%"/>
|
218 |
+
|
219 |
+
|
220 |
+
|
221 |
+
### Reference
|
222 |
+
|
223 |
+
<p id="3"></p>
|
224 |
+
|
225 |
+
- Open-Sora-Plan: https://github.com/PKU-YuanGroup/Open-Sora-Plan
|
226 |
+
- Open-Sora: https://github.com/hpcaitech/Open-Sora
|
227 |
+
- SVD: https://github.com/Stability-AI/generative-models
|