visheratin
/

MC-LLaVA-3b

Inference Endpoints

Model card Files Files and versions Community

visheratin commited on Jan 17

Commit

d0454de

•

1 Parent(s): 06b6855

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ widget:
 ## Model details
-The core idea behind multi-crop LLaVA is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
 Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
 For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which

 ## Model details
+The core idea behind multi-crop LLaVA (MC-LLaVA) is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
 Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
 For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which