MC-LLaVA-3b / README.md
visheratin's picture
Update README.md
06bc212 verified
metadata
datasets:
  - liuhaotian/LLaVA-Pretrain
  - liuhaotian/LLaVA-Instruct-150K
language:
  - en
tags:
  - llava
  - phi
license: mit
library_name: transformers
widget:
  - text: What animal is it?
    src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
  - text: Where is it?
    src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg

Multi-crop LLaVA-3b

Open In Colab

Model details

Usually, in LLaVA models, we generate N embeddings for the image, which we then combine with text embeddings and send to the LLM. But what if instead of creating N tokens for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).

You can read more about the model in the blog post.

MC-LLaVA-3b was fine-tuned from Phi-2 merge using vision tower from SigLIP 400M.

As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:

<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

How to use

from transformers import AutoModel, AutoProcessor
import torch

model = AutoModel.from_pretrained("visheratin/MC-LLaVA-3b", torch_dtype=torch.float16, trust_remote_code=True).to("cuda")

processor = AutoProcessor.from_pretrained("visheratin/MC-LLaVA-3b", trust_remote_code=True)

with torch.inference_mode():
    inputs = processor(prompt, [raw_image], model, max_crops=100, num_tokens=728)
    output = model.generate(**inputs, max_new_tokens=200, use_cache=True, do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=processor.tokenizer.eos_token_id)

result = processor.tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", "")
print(result)

Benchmarks

  • TextVQA - 50.9%
  • GQA - 59.5%
  • VQAv2 - 76.72%
  • VizWiz - 32.68%
  • V*-bench - OCR - 56.66%, GPT4V-hard - 52.94%, direct attributes - 40.86%, relative position - 56.57%

Examples

Open In Colab

License

The model is licensed under MIT license, but since the data used for model training is largely synthetic, you should also follow OpenAI and Google Gemini terms of service. Which means don't create competitor models for them.

Acknowledgments

Thanks to Lambda for providing a machine to train the model.

Thanks to ML Collective for continuous support and providing compute resources for testing the model.