zR commited on
Commit
17ab7f9
1 Parent(s): 99eaacc
Files changed (2) hide show
  1. README.md +48 -45
  2. README_zh.md +17 -19
README.md CHANGED
@@ -28,8 +28,9 @@ inference: false
28
 
29
  ## Model Introduction
30
 
31
- CogVideoX is an open-source video generation model that shares the same origins as [清影](https://chatglm.cn/video).
32
- The table below provides a list of the video generation models we currently offer, along with their basic information.
 
33
 
34
  <table style="border-collapse: collapse; width: 100%;">
35
  <tr>
@@ -38,41 +39,41 @@ The table below provides a list of the video generation models we currently offe
38
  <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
39
  </tr>
40
  <tr>
41
- <td style="text-align: center;">Model Introduction</td>
42
- <td style="text-align: center;">An entry-level model with good compatibility. Low cost for running and secondary development.</td>
43
- <td style="text-align: center;">A larger model with higher video generation quality and better visual effects.</td>
44
- </tr>
45
- <tr>
46
- <td style="text-align: center;">Inference Precision</td>
47
- <td style="text-align: center;">FP16, FP32<br><b>NOT support BF16</b> </td>
48
- <td style="text-align: center;">BF16, FP32<br><b>NOT support FP16</b> </td>
49
  </tr>
50
  <tr>
51
  <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
52
  <td style="text-align: center;">FP16: ~90* s</td>
53
- <td style="text-align: center;">BF16: ~200* s</td>
 
 
 
 
 
54
  </tr>
55
  <tr>
56
- <td style="text-align: center;">Single GPU Memory Consumption</td>
57
- <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
58
- <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
59
  </tr>
60
  <tr>
61
- <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
62
- <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
63
- <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
64
  </tr>
65
  <tr>
66
- <td style="text-align: center;">Fine-Tuning Memory Consumption (Per GPU)</td>
67
- <td style="text-align: center;">47 GB (bs=1, LORA)<br>61 GB (bs=2, LORA)<br>62GB (bs=1, SFT)</td>
68
- <td style="text-align: center;">63 GB (bs=1, LORA)<br>80 GB (bs=2, LORA)<br>75GB (bs=1, SFT)<br></td>
69
  </tr>
70
  <tr>
71
  <td style="text-align: center;">Prompt Language</td>
72
  <td colspan="2" style="text-align: center;">English*</td>
73
  </tr>
74
  <tr>
75
- <td style="text-align: center;">Maximum Prompt Length</td>
76
  <td colspan="2" style="text-align: center;">226 Tokens</td>
77
  </tr>
78
  <tr>
@@ -81,13 +82,13 @@ The table below provides a list of the video generation models we currently offe
81
  </tr>
82
  <tr>
83
  <td style="text-align: center;">Frame Rate</td>
84
- <td colspan="2" style="text-align: center;">8 frames per second</td>
85
  </tr>
86
  <tr>
87
  <td style="text-align: center;">Video Resolution</td>
88
- <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
89
  </tr>
90
- <tr>
91
  <td style="text-align: center;">Positional Encoding</td>
92
  <td style="text-align: center;">3d_sincos_pos_embed</td>
93
  <td style="text-align: center;">3d_rope_pos_embed<br></td>
@@ -96,17 +97,21 @@ The table below provides a list of the video generation models we currently offe
96
 
97
  **Data Explanation**
98
 
99
- + When testing with the diffusers library, the `enable_model_cpu_offload()` and `pipe.vae.enable_tiling()` options were
100
- enabled. This configuration was not tested on non-**NVIDIA A100 / H100** devices, but it should generally work on all
101
- **NVIDIA Ampere architecture** and above. Disabling these optimizations will significantly increase memory usage, with
102
- peak usage approximately 3 times the values shown in the table.
103
- + For multi-GPU inference, `enable_model_cpu_offload()` must be disabled.
104
- + Inference speed tests used the above memory optimization options. Without these optimizations, inference speed
105
- increases by around 10%.
106
- + The model supports only English input. For other languages, translation to English is recommended during large model
107
- processing.
108
-
109
- + **Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
 
 
 
 
110
  models. Feel free to visit our GitHub for more information.
111
 
112
  ## Quick Start 🤗
@@ -119,13 +124,16 @@ optimizations and conversions to get a better experience.**
119
  1. Install the required dependencies
120
 
121
  ```shell
122
- pip install --upgrade opencv-python transformers diffusers
 
 
 
 
123
  ```
124
 
125
  2. Run the code
126
 
127
  ```python
128
- import gc
129
  import torch
130
  from diffusers import CogVideoXPipeline
131
  from diffusers.utils import export_to_video
@@ -138,11 +146,6 @@ pipe = CogVideoXPipeline.from_pretrained(
138
  )
139
 
140
  pipe.enable_model_cpu_offload()
141
-
142
- gc.collect()
143
- torch.cuda.empty_cache()
144
- torch.cuda.reset_accumulated_memory_stats()
145
- torch.cuda.reset_peak_memory_stats()
146
  pipe.vae.enable_tiling()
147
 
148
  video = pipe(
@@ -157,9 +160,6 @@ video = pipe(
157
  export_to_video(video, "output.mp4", fps=8)
158
  ```
159
 
160
- If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
161
- OpenCV saving video issues). Simply use a different player to view the video.
162
-
163
  ## Explore the Model
164
 
165
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
@@ -169,6 +169,7 @@ Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
169
  3. Reasoning and fine-tuning of SAT version models, and even pre-release.
170
  4. Project update log dynamics, more interactive opportunities.
171
  5. CogVideoX toolchain to help you better use the model.
 
172
 
173
  ## Model License
174
 
@@ -184,3 +185,5 @@ This model is released under the [CogVideoX LICENSE](LICENSE).
184
  year={2024}
185
  }
186
  ```
 
 
 
28
 
29
  ## Model Introduction
30
 
31
+ CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video).
32
+ The table below displays the list of video generation models we currently offer, along with their foundational
33
+ information.
34
 
35
  <table style="border-collapse: collapse; width: 100%;">
36
  <tr>
 
39
  <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
40
  </tr>
41
  <tr>
42
+ <td style="text-align: center;">Model Description</td>
43
+ <td style="text-align: center;">Entry-level model with compatibility and low cost for running and secondary development.</td>
44
+ <td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
 
 
 
 
 
45
  </tr>
46
  <tr>
47
  <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
48
  <td style="text-align: center;">FP16: ~90* s</td>
49
+ <td style="text-align: center;">BF16: ~180* s</td>
50
+ </tr>
51
+ <tr>
52
+ <td style="text-align: center;">Inference Precision</td>
53
+ <td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, INT8, no support for INT4</td>
54
+ <td style="text-align: center;"><b>BF16(recommended)</b>, FP16, FP32, INT8, no support for INT4</td>
55
  </tr>
56
  <tr>
57
+ <td style="text-align: center;">Single GPU Memory Usage<br></td>
58
+ <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
59
+ <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
60
  </tr>
61
  <tr>
62
+ <td style="text-align: center;">Multi-GPU Memory Usage</td>
63
+ <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
64
+ <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
65
  </tr>
66
  <tr>
67
+ <td style="text-align: center;">Fine-tuning Memory Usage (per GPU)</td>
68
+ <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
69
+ <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
70
  </tr>
71
  <tr>
72
  <td style="text-align: center;">Prompt Language</td>
73
  <td colspan="2" style="text-align: center;">English*</td>
74
  </tr>
75
  <tr>
76
+ <td style="text-align: center;">Max Prompt Length</td>
77
  <td colspan="2" style="text-align: center;">226 Tokens</td>
78
  </tr>
79
  <tr>
 
82
  </tr>
83
  <tr>
84
  <td style="text-align: center;">Frame Rate</td>
85
+ <td colspan="2" style="text-align: center;">8 frames / second </td>
86
  </tr>
87
  <tr>
88
  <td style="text-align: center;">Video Resolution</td>
89
+ <td colspan="2" style="text-align: center;">720 * 480, no support for other resolutions (including fine-tuning)</td>
90
  </tr>
91
+ <tr>
92
  <td style="text-align: center;">Positional Encoding</td>
93
  <td style="text-align: center;">3d_sincos_pos_embed</td>
94
  <td style="text-align: center;">3d_rope_pos_embed<br></td>
 
97
 
98
  **Data Explanation**
99
 
100
+ + When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
101
+ optimization were enabled. This solution has not been tested on devices other than **NVIDIA A100 / H100**. Typically,
102
+ this solution is adaptable to all devices above the **NVIDIA Ampere architecture**. If the optimization is disabled,
103
+ memory usage will increase significantly, with peak memory being about 3 times the table value.
104
+ + The CogVideoX-2B model was trained using `FP16` precision, so it is recommended to use `FP16` for inference.
105
+ + For multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
106
+ + Using the INT8 model will lead to reduced inference speed. This is done to allow low-memory GPUs to perform inference
107
+ while maintaining minimal video quality loss, though the inference speed will be significantly reduced.
108
+ + Inference speed tests also used the memory optimization mentioned above. Without memory optimization, inference speed
109
+ increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
110
+ + The model only supports English input; other languages can be translated to English for refinement by large models.
111
+
112
+ **Note**
113
+
114
+ + Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
115
  models. Feel free to visit our GitHub for more information.
116
 
117
  ## Quick Start 🤗
 
124
  1. Install the required dependencies
125
 
126
  ```shell
127
+ # diffusers>=0.30.1
128
+ # transformers>=0.44.0
129
+ # accelerate>=0.33.0 (suggest install from source)
130
+ # imageio-ffmpeg>=0.5.1
131
+ pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
132
  ```
133
 
134
  2. Run the code
135
 
136
  ```python
 
137
  import torch
138
  from diffusers import CogVideoXPipeline
139
  from diffusers.utils import export_to_video
 
146
  )
147
 
148
  pipe.enable_model_cpu_offload()
 
 
 
 
 
149
  pipe.vae.enable_tiling()
150
 
151
  video = pipe(
 
160
  export_to_video(video, "output.mp4", fps=8)
161
  ```
162
 
 
 
 
163
  ## Explore the Model
164
 
165
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
 
169
  3. Reasoning and fine-tuning of SAT version models, and even pre-release.
170
  4. Project update log dynamics, more interactive opportunities.
171
  5. CogVideoX toolchain to help you better use the model.
172
+ 6. INT8 model inference code support.
173
 
174
  ## Model License
175
 
 
185
  year={2024}
186
  }
187
  ```
188
+
189
+
README_zh.md CHANGED
@@ -6,7 +6,7 @@
6
  </div>
7
  <p align="center">
8
  <a href="https://huggingface.co/THUDM/CogVideoX-5b/blob/main/README.md">📄 Read in English</a> |
9
- <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B">🤗 Huggingface Space</a> |
10
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
11
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
12
  </p>
@@ -28,21 +28,21 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
28
  <td style="text-align: center;">入门级模型,兼顾兼容性。运行,二次开发成本低。</td>
29
  <td style="text-align: center;">视频生成质量更高,视觉效果更好的更大尺寸模型。</td>
30
  </tr>
31
- <tr>
32
- <td style="text-align: center;">推理精度</td>
33
- <td style="text-align: center;">FP16, FP32<br><b>不支持 BF16</b> </td>
34
- <td style="text-align: center;">BF16, FP32<br><b>不支持 FP16</b> </td>
35
- </tr>
36
  <tr>
37
  <td style="text-align: center;">推理速度<br>(Step = 50)</td>
38
  <td style="text-align: center;">FP16: ~90* s</td>
39
- <td style="text-align: center;">BF16: ~200* s</td>
40
  </tr>
41
  <tr>
42
  <td style="text-align: center;">单GPU显存消耗<br></td>
43
  <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
44
  <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
45
  </tr>
 
 
 
 
 
46
  <tr>
47
  <td style="text-align: center;">多GPU推理显存消耗</td>
48
  <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
@@ -83,10 +83,11 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
83
  **数据解释**
84
 
85
  + 使用 diffusers 库进行测试时,启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化,该方案未测试在非
86
- **NVIDIA A100 / H100** 外的实际显存占用,通常,该方案可以适配于所有 **NVIDIA 安培架构**
87
  以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。
88
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
89
- + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。
 
90
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
91
 
92
  **提醒**
@@ -102,13 +103,16 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
102
  1. 安装对应的依赖
103
 
104
  ```shell
105
- pip install --upgrade opencv-python transformers accelerate diffusers
 
 
 
 
106
  ```
107
 
108
- 2. 运行代码
109
 
110
  ```python
111
- import gc
112
  import torch
113
  from diffusers import CogVideoXPipeline
114
  from diffusers.utils import export_to_video
@@ -121,11 +125,6 @@ pipe = CogVideoXPipeline.from_pretrained(
121
  )
122
 
123
  pipe.enable_model_cpu_offload()
124
-
125
- gc.collect()
126
- torch.cuda.empty_cache()
127
- torch.cuda.reset_accumulated_memory_stats()
128
- torch.cuda.reset_peak_memory_stats()
129
  pipe.vae.enable_tiling()
130
 
131
  video = pipe(
@@ -140,8 +139,6 @@ video = pipe(
140
  export_to_video(video, "output.mp4", fps=8)
141
  ```
142
 
143
- 如果您生成的模型在 MAC 默认播放器上表现为 "全绿" 无法正常观看,属于正常现象 (OpenCV保存视频问题),仅需更换一个播放器观看。
144
-
145
  ## 深入研究
146
 
147
  欢迎进入我们的 [github](https://github.com/THUDM/CogVideo),你将获得:
@@ -151,6 +148,7 @@ export_to_video(video, "output.mp4", fps=8)
151
  3. SAT版本模型进行推理和微调,甚至预发布。
152
  4. 项目更新日志动态,更多互动机会。
153
  5. CogVideoX 工具链,帮助您更好的使用模型。
 
154
 
155
  ## 模型协议
156
 
 
6
  </div>
7
  <p align="center">
8
  <a href="https://huggingface.co/THUDM/CogVideoX-5b/blob/main/README.md">📄 Read in English</a> |
9
+ <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B">🤗 Huggingface Space</a> |
10
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> |
11
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
12
  </p>
 
28
  <td style="text-align: center;">入门级模型,兼顾兼容性。运行,二次开发成本低。</td>
29
  <td style="text-align: center;">视频生成质量更高,视觉效果更好的更大尺寸模型。</td>
30
  </tr>
 
 
 
 
 
31
  <tr>
32
  <td style="text-align: center;">推理速度<br>(Step = 50)</td>
33
  <td style="text-align: center;">FP16: ~90* s</td>
34
+ <td style="text-align: center;">BF16: ~180* s</td>
35
  </tr>
36
  <tr>
37
  <td style="text-align: center;">单GPU显存消耗<br></td>
38
  <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
39
  <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
40
  </tr>
41
+ <tr>
42
+ <td style="text-align: center;">单GPU量化推理显存消耗(diffusers)<br></td>
43
+ <td style="text-align: center;"><b>INT8: 7.8GB* </b><br>INT4: 不支持</td>
44
+ <td style="text-align: center;"><b>INT8: 11.5GB* </b><br>INT4: 不支持</td>
45
+ </tr>
46
  <tr>
47
  <td style="text-align: center;">多GPU推理显存消耗</td>
48
  <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
 
83
  **数据解释**
84
 
85
  + 使用 diffusers 库进行测试时,启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化,该方案未测试在非
86
+ **NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常,该方案可以适配于所有 **NVIDIA 安培架构**
87
  以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。
88
  + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
89
+ + 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
90
+ + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
91
  + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
92
 
93
  **提醒**
 
103
  1. 安装对应的依赖
104
 
105
  ```shell
106
+ # diffusers>=0.30.1
107
+ # transformers>=0.44.0
108
+ # accelerate>=0.33.0 (suggest install from source)
109
+ # imageio-ffmpeg>=0.5.1
110
+ pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
111
  ```
112
 
113
+ 2. 运行代码 (BF16 / FP16)
114
 
115
  ```python
 
116
  import torch
117
  from diffusers import CogVideoXPipeline
118
  from diffusers.utils import export_to_video
 
125
  )
126
 
127
  pipe.enable_model_cpu_offload()
 
 
 
 
 
128
  pipe.vae.enable_tiling()
129
 
130
  video = pipe(
 
139
  export_to_video(video, "output.mp4", fps=8)
140
  ```
141
 
 
 
142
  ## 深入研究
143
 
144
  欢迎进入我们的 [github](https://github.com/THUDM/CogVideo),你将获得:
 
148
  3. SAT版本模型进行推理和微调,甚至预发布。
149
  4. 项目更新日志动态,更多互动机会。
150
  5. CogVideoX 工具链,帮助您更好的使用模型。
151
+ 6. INT8 模型推理代码。
152
 
153
  ## 模型协议
154