Chameleon 🦎 by Meta is now available in @huggingface transformers 😍 A multimodal model that comes in 7B and 34B sizes 🤩 But what makes this model so special? keep reading ⇣ ![video_1](video_1.mp4) [Demo](https://t.co/GsGE17fSdI] | [Models](https://t.co/cWUiVbsRz6) Find below the API to load this model locally use it ⬇️ ![image_1](image_1.jpg) Chameleon is a unique model: it attempts to scale early fusion 🤨 But what is early fusion? Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder. ![image_2](image_2.jpg) Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏 ![image_3](image_3.jpg) Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training This way they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO) ![image_4](image_4.jpg) This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use. ![image_5](image_5.jpg) One can also do text-only prompting, authors noted the model catches up with larger LLMs, and you can also see how it compares to VLMs with image-text prompting. ![image_6](image_6.jpg) ![image_7](image_7.jpg) > [!TIP] Ressources: [Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://arxiv.org/abs/2405.09818) by Chameleon Team (2024) [GitHub](https://github.com/facebookresearch/chameleon) [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/chameleon) > [!NOTE] [Original tweet](https://twitter.com/mervenoyann/status/1814278511785312320) (July 19, 2024)