ali-vilab
/

modelscope-damo-text-to-video-synthesis

Text-to-Video

OpenCLIP

Model card Files Files and versions Community

update_checkpoint

by wenmengzhou - opened Mar 21, 2023

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

-21

Files changed (2) hide show

README.md +5 -20
text2video_pytorch_model.pth +1 -1

README.md CHANGED Viewed

@@ -5,11 +5,11 @@ pipeline_tag: text-to-video
 The original repo is [here](https://modelscope.cn/models/damo/text-to-video-synthesis/summary).
-**We Are Hiring!** (Based in Beijing / Hangzhou, China.)
 If you're looking for an exciting challenge and the opportunity to work with cutting-edge technologies in AIGC and large-scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us.
-EMAIL: yingya.zyy@alibaba-inc.com
 This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.
@@ -25,17 +25,14 @@ This model has a wide range of applications and can reason and generate videos b
 ## How to use
-The model has been launched on [ModelScope Studio](https://modelscope.cn/studios/damo/text-to-video-synthesis/summary) and [huggingface](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis), you can experience it directly; you can also refer to [Colab page](https://colab.research.google.com/drive/1uW1ZqswkQ9Z9bp5Nbo5z59cAn7I0hE6R?usp=sharing#scrollTo=bSluBq99ObSk) to build it yourself.
-In order to facilitate the experience of the model, users can refer to the [Aliyun Notebook Tutorial](https://modelscope.cn/headlines/detail/26) to quickly develop this Text-to-Video model.
-This demo requires about 16GB CPU RAM and 16GB GPU RAM. Under the ModelScope framework, the current model can be used by calling a simple Pipeline, where the input must be in dictionary format, the legal key value is 'text', and the content is a short text. This model currently only supports inference on the GPU. Enter specific code examples as follows:
 ### Operating environment (Python Package)
 ```
-pip install modelscope==1.4.2
 pip install open_clip_torch
 pip install pytorch-lightning
 ```
@@ -85,15 +82,3 @@ The output mp4 file can be viewed by [VLC media player](https://www.videolan.org
 ## Training data
 The training data includes [LAION5B](https://huggingface.co/datasets/laion/laion2B-en), [ImageNet](https://www.image-net.org/), [Webvid](https://m-bain.github.io/webvid-dataset/) and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.
-## Citation
-```bibtex
-    @InProceedings{VideoFusion,
-        author    = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
-        title     = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
-        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
-        month     = {June},
-        year      = {2023}
-    }
-```

 The original repo is [here](https://modelscope.cn/models/damo/text-to-video-synthesis/summary).
+We Are Hiring! (Based on Beijing / Hangzhou, China.)
 If you're looking for an exciting challenge and the opportunity to work with cutting-edge technologies in AIGC and large-scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us.
+EMAIL: wangjiuniu.wjn@alibaba-inc.com
 This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.
 ## How to use
+Under the ModelScope framework, the current model can be used by calling a simple Pipeline, where the input must be in dictionary format, the legal key value is 'text', and the content is a short text. This model currently only supports inference on the GPU. Enter specific code examples as follows:
+For Colab usage, you can view [this webpage](https://colab.research.google.com/drive/1uW1ZqswkQ9Z9bp5Nbo5z59cAn7I0hE6R?usp=sharing).
 ### Operating environment (Python Package)
 ```
+pip install git+https://github.com/modelscope/modelscope.git
 pip install open_clip_torch
 pip install pytorch-lightning
 ```
 ## Training data
 The training data includes [LAION5B](https://huggingface.co/datasets/laion/laion2B-en), [ImageNet](https://www.image-net.org/), [Webvid](https://m-bain.github.io/webvid-dataset/) and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.

text2video_pytorch_model.pth CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9e139fb08e50a6c072127d9ecef4fe8e91dbdf24edad23a4e1a7c569f0ca3488
 size 5645549049

 version https://git-lfs.github.com/spec/v1
+oid sha256:d9609d02717b799137a97244844ab6df0d1a071568a1d24dcb62d9050f3a24a3
 size 5645549049