Ligeng-Zhu
/

VILA15_3b

Text Generation

feature-extraction

Model card Files Files and versions Community

VILA15_3b / README.md

Ligeng-Zhu's picture

Upload files with huggingface_hub

13f83b4 verified 4 months ago

|

history blame contribute delete

2.75 kB

	---
	license: cc-by-nc-4.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- VILA
	- VLM
	---

	# VILA Model Card

	## Model details

	Model type:
	VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. VILA unveils appealing capabilities, including: multi-image reasoning, in-context learning, visual chain-of-thought, and better world knowledge.

	Model date:
	VILA1.5-3b was trained in May 2024.

	Paper or resources for more information:
	https://github.com/Efficient-Large-Model/VILA

	```
	@misc{lin2023vila,
	title={VILA: On Pre-training for Visual Language Models},
	author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
	year={2023},
	eprint={2312.07533},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## License
	- The code is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file.
	- The pretrained weights are released under the [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
	- The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
	- [Model License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA
	- [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI
	- [Dataset Licenses](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/LICENSE) for each one used during training.

	Where to send questions or comments about the model:
	https://github.com/Efficient-Large-Model/VILA/issues

	## Intended use
	Primary intended uses:
	The primary use of VILA is research on large multimodal models and chatbots.

	Primary intended users:
	The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

	## Training dataset
	See [Dataset Preparation](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/README.md) for more details.

	## Evaluation dataset
	A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.