BornFly commited on
Commit
4fe46e1
1 Parent(s): f99893f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +227 -3
README.md CHANGED
@@ -1,3 +1,227 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AsymmetricMagVitV2
2
+ Lightweight open-source reproduction of MagVitV2, fully aligned with the paper’s functionality. Supports image and video joint encoding and decoding, as well as videos of arbitrary length and resolution.
3
+
4
+ * All spatio-temporal operators are implemented using causal 3D to avoid video instability caused by 2D+1D, ensures that the FVD does not sudden increases.
5
+ * The Encoder and Decoder support arbitrary resolutions, support auto-regressive inference for arbitrary durations.
6
+ * Training employs multi-resolution and dynamic-duration mixed training, allowing decoding of videos with arbitrary odd frames as long as GPU memory permits, demonstrating temporal extrapolation capability.
7
+ * The model is closely aligned with MagVitV2 but with reduced parameter, particularly in the lightweight Encoder, reducing the burden of caching VAE features.
8
+
9
+
10
+
11
+ ## Demo
12
+
13
+
14
+ ### 16 channel VAE image reconstruction
15
+
16
+
17
+ <div style="display: flex; justify-content: space-between;">
18
+ <div style="flex: 1; padding-right: 5px;">
19
+ <a href="https://upos-sz-mirrorbd.bilivideo.com/upgcxcode/49/26/500001606242649/500001606242649-1-192.mp4?e=ig8euxZM2rNcNbRVhwdVhwdlhWdVhwdVhoNvNC8BqJIzNbfq9rVEuxTEnE8L5F6VnEsSTx0vkX8fqJeYTj_lta53NCM=&uipk=5&nbs=1&deadline=1720214786&gen=playurlv2&os=bdbv&oi=2584261250&trid=bbf4c7694b334e96b66f466568155cfbO&mid=0&platform=html5&og=hw&upsig=1fb8442f381fb7d42fce1f234299118e&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform,og&bvc=vod&nettype=1&orderid=0,3&buvid=&build=7330300&f=O_0_0&bw=71316&logo=80000000">
20
+ <img src="data/show/gif/vae_16z_bf16_sw_17_wukong.gif" alt="60s 3840x2160" style="width: 100%; height: auto;">
21
+ </a>
22
+ </div>
23
+ <div style="flex: 1; padding-left: 5px;">
24
+ <a href="https://upos-sz-mirrorcos.bilivideo.com/upgcxcode/20/29/500001606242920/500001606242920-1-192.mp4?e=ig8euxZM2rNcNbRghwdVhwdlhWNVhwdVhoNvNC8BqJIzNbfq9rVEuxTEnE8L5F6VnEsSTx0vkX8fqJeYTj_lta53NCM=&uipk=5&nbs=1&deadline=1720214870&gen=playurlv2&os=cosbv&oi=2584261250&trid=19d089cf9c3c402c83d4608511c50f60O&mid=0&platform=html5&og=cos&upsig=faeda47e09a67f389cf8ec67c7e0c17c&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform,og&bvc=vod&nettype=1&orderid=0,3&buvid=&build=7330300&f=O_0_0&bw=120172&logo=80000000">
25
+ <img src="data/show/gif/vae_16z_bf16_sw_17_tokyo_walk_h264_16s.gif" alt="60s 1920x1080" style="width: 100%; height: auto;">
26
+ </a>
27
+ </div>
28
+ </div>
29
+
30
+ * Converting MP4 to GIF may result in detail loss, pixelation, and incomplete duration. It is recommended to watch the original video for the best experience.
31
+
32
+ ###### 60s 3840x2160
33
+
34
+ [bilibili_Black Myth:Wu KongULR 16zVAE](https://www.bilibili.com/video/BV1ULaPecEga/?spm_id_from=333.999.0.0&vd_source=681432e843390b0f7192d64fa4ed9613)
35
+
36
+ ###### 60s 1920x1080
37
+
38
+ [bilibili_tokyo_walk ULR 16zVAE](https://www.bilibili.com/video/BV1mLaPecEXP/?spm_id_from=333.999.0.0&vd_source=681432e843390b0f7192d64fa4ed9613)
39
+
40
+
41
+ ##### image reconstruction
42
+
43
+ <table>
44
+ <tr>
45
+ <td><img src="data/show/images/16z/mj_16z_1.png" alt="1" style="width:100%;"></td>
46
+ <td><img src="data/show/images/16z/mj_16z_2.png" alt="2" style="width:100%;"></td>
47
+ <td><img src="data/show/images/16z/mj_16z_3.png" alt="3" style="width:100%;"></td>
48
+ </tr>
49
+ <tr>
50
+ <td><img src="data/show/images/16z/mj_16z_4.png" alt="4" style="width:100%;"></td>
51
+ <td><img src="data/show/images/16z/mj_16z_5.png" alt="5" style="width:100%;"></td>
52
+ <td><img src="data/show/images/16z/mj_16z_6.png" alt="6" style="width:100%;"></td>
53
+ </tr>
54
+ <tr>
55
+ <td><img src="data/show/images/16z/mj_16z_7.png" alt="7" style="width:100%;"></td>
56
+ <td><img src="data/show/images/16z/mj_16z_8.png" alt="8" style="width:100%;"></td>
57
+ <td><img src="data/show/images/16z/mj_16z_9.png" alt="9" style="width:100%;"></td>
58
+ </tr>
59
+ </table>
60
+
61
+
62
+ ## Contents
63
+
64
+ - [Installation](#installation)
65
+ - [Model Weights](#model-weights)
66
+ - [Metric](#metric)
67
+ - [Inference](#inference)
68
+ - [TODO List](#1)
69
+ - [Contact Us](#2)
70
+ - [Reference](#3)
71
+
72
+ ### Installation
73
+
74
+ <a name="installation"></a>
75
+
76
+ #### 1. Clone the repo
77
+
78
+ ```shell
79
+ git clonehttps://github.com/bornfly-detachment/AsymmetricMagVitV2.git
80
+ cd AsymmetricMagVitV2
81
+ ```
82
+
83
+ #### 2. Setting up the virtualenv
84
+
85
+ This is assuming you have navigated to the `AsymmetricMagVitV2` root after cloning it.
86
+
87
+
88
+ ```shell
89
+ # install required packages from pypi
90
+ python3 -m venv .pt2
91
+ source .pt2/bin/activate
92
+ pip3 install -r requirements/pt2.txt
93
+ ```
94
+
95
+
96
+ ### Model Weights
97
+
98
+ <details>
99
+ <summary>View more</summary>
100
+
101
+ | model | downsample (THW) | Encoder Size | Decoder Size|
102
+ |--------|--------|------|------|
103
+ |svd 2Dvae|1x8x8|34M|64M|
104
+ |AsymmetricMagVitV2|4x8x8|100M|159M|
105
+
106
+
107
+ | model | Data | #iterations | URL |
108
+ |------------------------|--------------|-------------|-----------------------------------------------------------------------|
109
+ | AsymmetricMagVitV2_4z |20M Intervid | 2node 1200k | [AsymmetricMagVitV2_4z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_4z) |
110
+ | AsymmetricMagVitV2_16z |20M Intervid | 4node 860k | [AsymmetricMagVitV2_16z](https://huggingface.co/BornFlyReborn/AsymmetricMagVitV2_16z) |
111
+
112
+
113
+ </details>
114
+
115
+ ### Metric
116
+
117
+ <a name="Metric"></a>
118
+
119
+ |model|temporal-frame| fvd |fid|psnr|ssim|
120
+ |-----|----|-----------|--|----|----|
121
+ |SVD VAE|1 | 190.6 |1.8|28.2|1.0|
122
+ |openSoraPlan|1 | 249.8 |1.04|29.6|0.99|
123
+ |openSoraPlan|17 | 725.4 |3.17|23.4|0.89|
124
+ |openSoraPlan|33 | 756.8 |3.5|23|0.89|
125
+ |AsymmetricMagVitV2_16z|1 | 106.7 |0.2|36.3|1.0|
126
+ |AsymmetricMagVitV2_16z|17 | 131.4 |0.8|30.7|1.0|
127
+ |AsymmetricMagVitV2_16z|33 | 208.2 |1.4|28.9|1.0|
128
+
129
+ Note:
130
+ 1. The test video is the original scale of data/videos/tokyo_walk.mp4. Previously, preprocessing with resize+CenterCrop256
131
+ resolution was also tested on a larger test set, and the results showed consistent trends. Now, it has been found
132
+ that high-resolution and original-sized videos pose the most challenging task for 3DVAE. Therefore, only this one video was tested,
133
+ configured at 8fps, and evaluated for the first 10 seconds.
134
+ 2. The evaluation code can be referenced in models/evaluation.py. However, it has been a while since I last ran it,
135
+ and there have been modifications to the inference code. Calculating FID and FVD scores depends on the model,
136
+ original image preprocessing, inference hyperparameters, and the randomness introduced by sampling encoder KL.
137
+ As a result, scores cannot be accurately reproduced. Nonetheless, this can serve as a reference for designing
138
+ one’s own benchmark.
139
+
140
+
141
+
142
+ ### Inference
143
+
144
+
145
+ #### Use AsymmetricMagVitV2 in your own code
146
+
147
+ ```python
148
+
149
+ from models.vae import AsymmetricMagVitV2Pipline
150
+ import torch
151
+ from models.utils.image_op import imdenormalize, imnormalize, read_video, read_image
152
+ import torchvision.transforms as transforms
153
+
154
+
155
+ device = "cuda" if torch.cuda.is_available() else "cpu"
156
+ dtype = torch.bfloat16
157
+ encoder_init_window = 17
158
+ input_path = "data/videos/tokyo_walk.mp4"
159
+ img_transform = transforms.Compose([transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
160
+ input, last_frame_id = read_video(input_path, encoder_init_window, sample_fps=8, img_transform, start=0)
161
+
162
+ model = AsymmetricMagVitV2Pipline.from_pretrained("BornFly/AsymmetricMagVitV2_16z").to(device, dtype).eval()
163
+ init_z, reg_log = model.encode(input, encoder_init_window, is_init_image=True, return_reg_log=True, unregularized=False)
164
+ init_samples = model.decode(init_z.to(device, dtype), decode_batch_size=1, is_init_image=True)
165
+
166
+ ```
167
+
168
+ #### High-resolution video encoding and decoding, greater than 720p(spatial-temporal slice)
169
+
170
+
171
+ ##### About Encoder hyperparameter configuration
172
+ * slice frame spatial using: --max_siz --min_size
173
+ * slice video temporal using: --encoder_init_window --encoder_window
174
+
175
+ If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 256 and 512 at maximum.
176
+
177
+ ##### About Decoder hyperparameter configuration
178
+
179
+ * slice latent spatial using: --min_latent_size --max_latent_size
180
+
181
+ (default GPU VRAM needs to exceed 28GB. If the GPU VRAM is not sufficient, metrics for evaluation can be adjusted to be between 32=256p/8 and 64=512p/8 at maximum.)
182
+
183
+ * slice latent temporal using: --decoder_init_window,
184
+
185
+ 5 frames of latent space corresponds to 17 frames of the original video.
186
+ The calculation formula is as follows: latent_T_dim = (frame_T_dim - 1) / temporal_downsample_num; in this model, temporal_downsample_num=4
187
+
188
+
189
+ ##### 1. encode & decode video
190
+
191
+ ```shell
192
+ python infer_vae.py --input_path data/videos/tokyo-walk.mp4 --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_videos > infer_vae_video.log 2>&1
193
+ ```
194
+
195
+ ##### 2. encode & decode image
196
+
197
+ ```shell
198
+ python infer_vae.py --input_path data/images --model_path vae_16z_bf16_hf --output_folder vae_eval_out/vae_4z_bf16_hf_images > infer_vae_image.log 2>&1
199
+ ```
200
+
201
+
202
+ ### TODO List
203
+
204
+ <p id="1"></p>
205
+
206
+ * Reproducing Sora, a 16-channel VAE integrated with SD3. Due to limited computational resources, the focus is on generating 1K high-definition dynamic wallpapers.
207
+
208
+ * Reproducing VideoPoet, supporting multimodal joint representation. Due to limited computational resources, the focus is on generating music videos.
209
+
210
+
211
+ ### Contact Us
212
+
213
+ <p id="2"></p>
214
+
215
+ 1. If there are any code-related questions, feel free to contact me via email——bornflyborntochange@outlook.com.
216
+ 2. You need to scan the image to join the WeChat group or if it is expired, add this student as a friend first to invite you.
217
+ <img src="data/assets/mmqrcode1720196270375.png" alt="ding group" width="30%"/>
218
+
219
+
220
+
221
+ ### Reference
222
+
223
+ <p id="3"></p>
224
+
225
+ - Open-Sora-Plan: https://github.com/PKU-YuanGroup/Open-Sora-Plan
226
+ - Open-Sora: https://github.com/hpcaitech/Open-Sora
227
+ - SVD: https://github.com/Stability-AI/generative-models