pages/LLaVA-NeXT/LLaVA-NeXT.md · merve/vision_papers at 83e76a197ba0760a5eb6a6c0488240d8361ea405

LLaVA-NeXT is recently merged to 🤗 Transformers and it outperforms many of the proprietary models like Gemini on various benchmarks! 🤩 For those who don't know LLaVA, it's a language model that can take image 💬 Let's take a look, demo and more in this.

LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨ LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:

Nous-Hermes-Yi-34B
Mistral-7B
Vicuna 7B & 13B

Thanks to Transformers integration, it is very easy to use LLaVA NeXT, not only standalone but also with 4-bit loading and Flash Attention 2 💜 See below on standalone usage 👇

To fit large models and make it even faster and memory efficient, you can enable Flash Attention 2 and load model into 4-bit using bitsandbytes ⚡️ transformers makes it very easy to do this! See below 👇

If you want to try the code right away, here's the notebook. Lastly, you can directly play with the LLaVA-NeXT based on Mistral-7B through the demo here 🤗

Ressources:
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024) GitHub
Hugging Face documentation

Original tweet (March 21, 2024)