visheratin
commited on
Commit
•
06b6855
1
Parent(s):
92d6894
Update README.md
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ widget:
|
|
16 |
src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
|
17 |
---
|
18 |
|
19 |
-
# LLaVA-3b
|
20 |
|
21 |
<a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
|
22 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
@@ -24,13 +24,16 @@ widget:
|
|
24 |
|
25 |
## Model details
|
26 |
|
27 |
-
|
28 |
-
|
29 |
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
|
|
|
|
|
|
34 |
|
35 |
As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
|
36 |
|
@@ -115,15 +118,20 @@ inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
|
|
115 |
**Generate the data**
|
116 |
|
117 |
```python
|
118 |
-
|
|
|
|
|
|
|
119 |
```
|
120 |
|
121 |
## Benchmarks
|
122 |
|
123 |
-
- TextVQA -
|
124 |
-
- GQA -
|
125 |
-
- VQAv2 -
|
126 |
-
- VizWiz - 24.
|
|
|
|
|
127 |
|
128 |
## License
|
129 |
|
|
|
16 |
src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
|
17 |
---
|
18 |
|
19 |
+
# Multi-crop LLaVA-3b
|
20 |
|
21 |
<a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
|
22 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
|
|
24 |
|
25 |
## Model details
|
26 |
|
27 |
+
The core idea behind multi-crop LLaVA is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
|
28 |
+
Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
|
29 |
|
30 |
+
For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
|
31 |
+
gives the token embedding of size [N, 2560]. Right now, the tokens do not contain explicit information about their position in the original image. I plan to add it later.
|
32 |
+
|
33 |
+
MC-LLaVA-3b was fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) using vision tower from
|
34 |
+
[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384).
|
35 |
+
|
36 |
+
The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
|
37 |
|
38 |
As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
|
39 |
|
|
|
118 |
**Generate the data**
|
119 |
|
120 |
```python
|
121 |
+
import torch
|
122 |
+
|
123 |
+
with torch.inference_mode():
|
124 |
+
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.4, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
|
125 |
```
|
126 |
|
127 |
## Benchmarks
|
128 |
|
129 |
+
- TextVQA - 38.59%
|
130 |
+
- GQA - 49.6%
|
131 |
+
- VQAv2 - 64.24%
|
132 |
+
- VizWiz - 24.88%
|
133 |
+
- POPE - 80.59%
|
134 |
+
- V*-bench - 52.25% (OCR - 46.66%, GPT4V-hard - 41.17%, direct attributes - 43.48%, relative position - 65.79%)
|
135 |
|
136 |
## License
|
137 |
|