--- license: openrail --- This repository contains a pruned and isolated pipeline for Stage 2 of [StreamingT2V](https://streamingt2v.github.io/), dubbed "VidXTend." This model's primary purpose is extending 16-frame 256px x 256x animations by 8 frames at a time (one second at 8fps.) ``` @article{henschel2024streamingt2v, title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text}, author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey}, journal={arXiv preprint arXiv:2403.14773}, year={2024} } ``` # Usage ## Installation First, install the VidXTend package into your python environment. If you're creating a new environment for VidXTend, be sure you also specify the version of torch you want with CUDA support, or else this will try to run only on CPU. ```sh pip install git+https://github.com/painebenjamin/vidxtend.git ``` ## Command-Line A command-line utility `vidxtend` is installed with the package. ```sh Usage: vidxtend [OPTIONS] VIDEO PROMPT Run VidXtend on a video file, concatenating the generated frames to the end of the video. Options: -fps, --frame-rate INTEGER Video FPS. Will default to the input FPS. -s, --seconds FLOAT The total number of seconds to add to the video. Multiply this number by frame rate to determine total number of new frames generated. [default: 1.0] -np, --negative-prompt TEXT Negative prompt for the diffusion process. -cfg, --guidance-scale FLOAT Guidance scale for the diffusion process. [default: 7.5] -ns, --num-inference-steps INTEGER Number of diffusion steps. [default: 50] -r, --seed INTEGER Random seed. -m, --model TEXT HuggingFace model name. -nh, --no-half Do not use half precision. -no, --no-offload Do not offload to the CPU to preserve GPU memory. -ns, --no-slicing Do not use VAE slicing. -g, --gpu-id INTEGER GPU ID to use. -sf, --model-single-file Download and use a single file instead of a directory. -cf, --config-file TEXT Config file to use when using the model- single-file option. Accepts a path or a filename in the same directory as the single file. Will download from the repository passed in the model option if not provided. [default: config.json] -mf, --model-filename TEXT The model file to download when using the model-single-file option. [default: vidxtend.safetensors] -rs, --remote-subfolder TEXT Remote subfolder to download from when using the model-single-file option. -cd, --cache-dir DIRECTORY Cache directory to download to. Default uses the huggingface cache. -o, --output FILE Output file. [default: output.mp4] -f, --fit [actual|cover|contain|stretch] Image fit mode. [default: cover] -a, --anchor [top-left|top-center|top-right|center-left|center-center|center-right|bottom-left|bottom-center|bottom-right] Image anchor point. [default: top-left] --help Show this message and exit. ``` ## Python You can create the pipeline, automatically pulling the weights from this repository, either as individual models: ```py from vidxtend import VidXTendPipeline pipeline = VidXTendPipeline.from_pretrained( "benjamin-paine/vidxtend", torch_dtype=torch.float16, variant="fp16", ) ``` Or, as a single file: ```py from vidxtend import VidXTendPipeline pipeline = VidXTendPipeline.from_single_file( "benjamin-paine/vidxtend", torch_dtype=torch.float16, variant="fp16", ) ``` Use these methods to improve performance: ``` pipeline.enable_model_cpu_offload() pipeline.enable_vae_slicing() pipeline.set_use_memory_efficient_attention_xformers() ``` Usage is as follows: ``` # Assume images is a list of PIL Images new_frames = pipeline( prompt=prompt, negative_prompt=None, # Optionally use negative prompt image=images[-8:], # Use final 8 frames of video input_frames_conditioning=images[:1], # Use first frame of video eta=1.0, guidance_scale=7.5, output_type="pil" ).frames[8:] # Remove the first 8 frames from the output as they were used as guide for final 8 ```