rain1011
/

pyramid-flow-sd3

Text-to-Video

Diffusers

Safetensors

image-to-video

sd3

Model card Files Files and versions Community

feifeiobama commited on Oct 30, 2024

Commit

ab0b682

verified ·

1 Parent(s): 2ebb878

Update README.md

Browse files

Files changed (1) hide show

README.md +23 -11

README.md CHANGED Viewed

@@ -7,13 +7,15 @@ base_model:
 pipeline_tag: text-to-video
 tags:
 - image-to-video
 ---
-# ⚡️Pyramid Flow⚡️
-[[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[Code 🚀]](https://github.com/jy0205/Pyramid-Flow) [[demo 🤗](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow)]
-This is the official repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on open-source datasets, it generates high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
 <table class="center" border="0" style="width: 100%; text-align: left;">
 <tr>
@@ -28,10 +30,15 @@ This is the official repository for Pyramid Flow, a training-efficient **Autoreg
 </tr>
 </table>
 ## News
-* `COMING SOON` ⚡️⚡️⚡️ Training code and new model checkpoints trained from scratch.
 * `2024.10.11`  🤗🤗🤗 [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
 * `2024.10.10`  🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
 ## Installation
@@ -48,7 +55,7 @@ conda activate pyramid
 pip install -r requirements.txt
 ```
-Then, you can directly download the model from [Huggingface](https://huggingface.co/rain1011/pyramid-flow-sd3). We provide both model checkpoints for 768p and 384p video generation. The 384p checkpoint supports 5-second video generation at 24FPS, while the 768p checkpoint supports up to 10-second video generation at 24FPS.
 ```python
 from huggingface_hub import snapshot_download
@@ -59,7 +66,9 @@ snapshot_download("rain1011/pyramid-flow-sd3", local_dir=model_path, local_dir_u
 ## Usage
-To use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https://github.com/jy0205/Pyramid-Flow/blob/main/video_generation_demo.ipynb). We further simplify it into the following two-step procedure. First, load the downloaded model:
 ```python
 import torch
@@ -76,10 +85,13 @@ model = PyramidDiTForVideoGeneration(
     model_variant='diffusion_transformer_768p',     # 'diffusion_transformer_384p'
 )
-model.vae.to("cuda")
-model.dit.to("cuda")
-model.text_encoder.to("cuda")
 model.vae.enable_tiling()
 ```
 Then, you can try text-to-video generation on your own prompts:
@@ -124,8 +136,6 @@ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
 export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
 ```
-We also support CPU offloading to allow inference with **less than 12GB** of GPU memory by adding a `cpu_offloading=True` parameter. This feature was contributed by [@Ednaordinary](https://github.com/Ednaordinary), see [#23](https://github.com/jy0205/Pyramid-Flow/pull/23) for details.
 ## Usage tips
 * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
@@ -147,6 +157,7 @@ The following video examples are generated at 5s, 768p, 24fps. For more results,
 </tr>
 </table>
 ## Acknowledgement
 We are grateful for the following awesome projects when implementing Pyramid Flow:
@@ -160,6 +171,7 @@ We are grateful for the following awesome projects when implementing Pyramid Flo
 ## Citation
 Consider giving this repository a star and cite Pyramid Flow in your publications if it helps your research.
 ```
 @article{jin2024pyramidal,
   title={Pyramidal Flow Matching for Efficient Video Generative Modeling},

 pipeline_tag: text-to-video
 tags:
 - image-to-video
+- sd3
 ---
+# ⚡️Pyramid Flow SD3⚡️
+[[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[Code 🚀]](https://github.com/jy0205/Pyramid-Flow) [[miniFLUX Model ⚡️]](https://huggingface.co/rain1011/pyramid-flow-miniflux) [[demo 🤗](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow)]
+This is the model repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on open-source datasets, it generates high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
 <table class="center" border="0" style="width: 100%; text-align: left;">
 <tr>
 </tr>
 </table>
 ## News
+* `2024.10.29` ⚡️⚡️⚡️ We release [training code](https://github.com/jy0205/Pyramid-Flow?tab=readme-ov-file#training) and [new model checkpoints](https://huggingface.co/rain1011/pyramid-flow-miniflux) with FLUX structure trained from scratch.
+  > We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint and 384p video checkpoint. We will release 768p video checkpoint in a few days.
 * `2024.10.11`  🤗🤗🤗 [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
 * `2024.10.10`  🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
 ## Installation
 pip install -r requirements.txt
 ```
+Then, download the model from [Huggingface](https://huggingface.co/rain1011) (there are two variants: [miniFLUX](https://huggingface.co/rain1011/pyramid-flow-miniflux) or [SD3](https://huggingface.co/rain1011/pyramid-flow-sd3)). The miniFLUX models support 1024p image and 384p video generation, and the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, while the 768p checkpoint generates up to 10-second video at 24FPS.
 ```python
 from huggingface_hub import snapshot_download
 ## Usage
+For inference, we provide Gradio demo, single-GPU, multi-GPU, and Apple Silicon inference code, as well as VRAM-efficient features such as CPU offloading. Please check our [code repository](https://github.com/jy0205/Pyramid-Flow?tab=readme-ov-file#inference) for usage.
+Below is a simplified two-step usage procedure. First, load the downloaded model:
 ```python
 import torch
     model_variant='diffusion_transformer_768p',     # 'diffusion_transformer_384p'
 )
 model.vae.enable_tiling()
+# model.vae.to("cuda")
+# model.dit.to("cuda")
+# model.text_encoder.to("cuda")
+# if you're not using sequential offloading bellow uncomment the lines above ^
+model.enable_sequential_cpu_offload()
 ```
 Then, you can try text-to-video generation on your own prompts:
 export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
 ```
 ## Usage tips
 * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
 </tr>
 </table>
 ## Acknowledgement
 We are grateful for the following awesome projects when implementing Pyramid Flow:
 ## Citation
 Consider giving this repository a star and cite Pyramid Flow in your publications if it helps your research.
 ```
 @article{jin2024pyramidal,
   title={Pyramidal Flow Matching for Efficient Video Generative Modeling},