visheratin
commited on
Commit
•
06bc212
1
Parent(s):
a1d449e
Update README.md
Browse files
README.md
CHANGED
@@ -28,6 +28,8 @@ Usually, in LLaVA models, we generate N embeddings for the image, which we then
|
|
28 |
for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
|
29 |
number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
|
30 |
|
|
|
|
|
31 |
MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
|
32 |
[SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).
|
33 |
|
|
|
28 |
for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the
|
29 |
number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).
|
30 |
|
31 |
+
You can read more about the model in the [blog post](https://huggingface.co/blog/visheratin/vlm-resolution-curse).
|
32 |
+
|
33 |
MC-LLaVA-3b was fine-tuned from [Phi-2 merge](vince62s/phi-2-psy) using vision tower from
|
34 |
[SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).
|
35 |
|