Initial commit

b6f77bf 10 days ago

3.79 kB

	---
	license: apache-2.0
	tags:
	- text-to-video
	- video-generation
	- baai-nova
	---

	# NOVA (d48w1024-osp480) Model Card

	## Model Details
	- Developed by: BAAI
	- Model type: Masked Autoregressive Text-to-Video Generation Model
	- Model size: 645M
	- Model precision: torch.float16 (FP16)
	- Model resolution: 768x480
	- Model Description: This is a model that can be used to generate and modify videos based on text prompts. It is a [Masked Autoregressive (MAR)](https://arxiv.org/abs/2406.11838) diffusion model that uses a pretrained text encoder ([Phi-2](https://huggingface.co/microsoft/phi-2)) and one VAE video tokenizer ([OpenSoraPlanV1.2-VAE](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0)).
	- Model License: [Apache 2.0 License](LICENSE)
	- Resources for more information: [GitHub Repository](https://github.com/baaivision/NOVA).

	## Examples

	Using the [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run NOVA in a simple and efficient manner.

	```bash
	pip install diffusers transformers accelerate imageio[ffmpeg]
	pip install git+ssh://git@github.com/baaivision/NOVA.git
	```

	Running the pipeline:

	```python
	import torch
	from diffnext.pipelines import NOVAPipeline
	from diffnext.utils import export_to_image, export_to_video

	model_id = "BAAI/nova-d48w1024-osp480"
	model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
	pipe = NOVAPipeline.from_pretrained(model_id, **model_args)
	pipe = pipe.to("cuda")

	prompt = "Many spotted jellyfish pulsating under water."

	image = pipe(prompt, max_latent_length=1).frames[0, 0]
	export_to_image(image, "jellyfish.jpg")

	video = pipe(prompt, max_latent_length=9).frames[0]
	export_to_video(video, "jellyfish.mp4", fps=12)
	```

	# Uses

	## Direct Use
	The model is intended for research purposes only. Possible research areas and tasks include

	- Research on generative models.
	- Applications in educational or creative tools.
	- Generation of artworks and use in design and other artistic processes.
	- Probing and understanding the limitations and biases of generative models.
	- Safe deployment of models which have the potential to generate harmful content.

	Excluded uses are described below.

	#### Out-of-Scope Use
	The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

	#### Misuse and Malicious Use
	Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

	- Mis- and disinformation.
	- Representations of egregious violence and gore.
	- Impersonating individuals without their consent.
	- Sexual content without consent of the people who might see it.
	- Sharing of copyrighted or licensed material in violation of its terms of use.
	- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
	- Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
	- Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.

	## Limitations and Bias

	### Limitations

	- The autoencoding part of the model is lossy.
	- The model cannot render complex legible text.
	- The model does not achieve perfect photorealism.
	- The fingers, .etc in general may not be generated properly.
	- The model was trained on a subset of the web datasets [LAION-5B](https://laion.ai/blog/laion-5b/) and [COYO-700M](https://github.com/kakaobrain/coyo-dataset), which contains adult, violent and sexual content.

	### Bias
	While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.