Spaces:
Running
LLaVA-NeXT is recently merged to 🤗 Transformers and it outperforms many of the proprietary models like Gemini on various benchmarks! 🤩 For those who don't know LLaVA, it's a language model that can take image 💬 Let's take a look, demo and more in this.
LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨ LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:
- Nous-Hermes-Yi-34B
- Mistral-7B
- Vicuna 7B & 13B
Thanks to Transformers integration, it is very easy to use LLaVA NeXT, not only standalone but also with 4-bit loading and Flash Attention 2 💜 See below on standalone usage 👇
To fit large models and make it even faster and memory efficient, you can enable Flash Attention 2 and load model into 4-bit using bitsandbytes ⚡️ transformers makes it very easy to do this! See below 👇
If you want to try the code right away, here's the notebook. Lastly, you can directly play with the LLaVA-NeXT based on Mistral-7B through the demo here 🤗
Ressources:
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024) GitHub
Hugging Face documentation
Original tweet (March 21, 2024)