visheratin
commited on
Commit
•
d0454de
1
Parent(s):
06b6855
Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ widget:
|
|
24 |
|
25 |
## Model details
|
26 |
|
27 |
-
The core idea behind multi-crop LLaVA is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
|
28 |
Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
|
29 |
|
30 |
For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
|
|
|
24 |
|
25 |
## Model details
|
26 |
|
27 |
+
The core idea behind multi-crop LLaVA (MC-LLaVA) is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
|
28 |
Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
|
29 |
|
30 |
For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
|