Spaces:

merve
/

vision_papers

Running

App Files Files Community

vision_papers / pages /Llava-NeXT-Interleave /Llava-NeXT-Interleave.md

lbourdois

Upload 174 files

94e735e verified 4 months ago

preview code

raw

history blame

1.82 kB

	The vision language model in this video is 0.5B and can take in image, video and 3D! 🤯 Llava-NeXT-Interleave is a new vision language model trained on interleaved image, video and 3D data keep reading ⥥⥥

	![video_1](video_1.jpg)

	This model comes with 0.5B, 7B and 7B-DPO variants, all can be used with Transformers 😍
	[Collection of models](https://t.co/sZsaglSXa3) \| [Demo](https://t.co/FbpaMWJY8k)
	See how to use below 👇🏻

	![image_1](image_1.jpg)

	Authors of this paper have explored training Llava-NeXT on interleaved data where the data consists of multiple modalities, including image(s), video, 3D 📚
	They have discovered that interleaved data increases results across all benchmarks!

	![image_2](image_2.jpg)

	The model can do task transfer from single image tasks to multiple images 🤯 The authors have trained the model on single images and code yet the model can solve coding with multiple images.

	![image_3](image_3.jpg)

	Same applies to other modalities, see below for video:

	![image_4](image_4.jpg)

	The model also has document understanding capabilities and many real-world application areas

	![image_5](image_5.jpg)

	This release also comes with the dataset this model was fine-tuned on 📖 [M4-Instruct-Data](https://t.co/rutXMtNC0I)

	![image_6](image_6.jpg)

	> [!TIP]
	Ressources:
	[LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/)
	by Feng Li, Renrui Zhang*, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li (2024)
	[GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Interleave.md)

	> [!NOTE]
	[Original tweet](https://twitter.com/mervenoyann/status/1813560292397203630) (July 17, 2024)