visheratin
/

MC-LLaVA-3b

Inference Endpoints

Model card Files Files and versions Community

visheratin commited on Feb 28

Commit

06bc212

•

1 Parent(s): a1d449e

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -28,6 +28,8 @@ Usually, in LLaVA models, we generate N embeddings for the image, which we then
 for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
 number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
 MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
 [SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).

 for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
 number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
+You can read more about the model in the [blog post](https://huggingface.co/blog/visheratin/vlm-resolution-curse).
 MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
 [SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).