Update README.md
Browse files
README.md
CHANGED
@@ -28,25 +28,25 @@ Our model was trained on publicly available image-caption data from the [LAION40
|
|
28 |
In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
|
29 |
|
30 |
| Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|
31 |
-
|
32 |
-
| LLM | Qwen2.5-7B
|
33 |
-
| AI2D |
|
34 |
-
| ScienceQA_img |
|
35 |
-
| GQA |
|
36 |
-
| InfoVQA_val |
|
37 |
-
| MMBench_cn_dev |
|
38 |
-
| MMBench_en_dev |
|
39 |
-
| MME(cognition) |
|
40 |
-
| MME(perception) |
|
41 |
-
| SeedBench |
|
42 |
-
| SeedBench_img |
|
43 |
-
| MMStar |
|
44 |
-
| MMMU |
|
45 |
-
| OCRBench |
|
46 |
-
| ChartQA |
|
47 |
-
| DocVQA_val |
|
48 |
-
| POPE | 88.69
|
49 |
-
| TextVQA_val | 61.69
|
50 |
|
51 |
|
52 |
|
@@ -56,32 +56,32 @@ This table presents the results of linear probe evaluations comparing CLIP and M
|
|
56 |
|
57 |
| Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|
58 |
|:---------------|:----------------------|:----------------------|
|
59 |
-
|
|
60 |
-
| Food101 |
|
61 |
-
| CIFAR-10 |
|
62 |
-
| CIFAR-100 |
|
63 |
-
| Birdsnap |
|
64 |
-
| SUN397 |
|
65 |
-
| Stanford Cars |
|
66 |
-
| FGVC Aircraft |
|
67 |
-
| Describable Textures Dataset |
|
68 |
-
| Oxford-IIIT Pets |
|
69 |
-
| Caltech-101 |
|
70 |
-
| Flowers102 |
|
71 |
-
| MNIST | 98.67 |
|
72 |
-
| STL-10 | 99.28 |
|
73 |
-
| EuroSAT |
|
74 |
-
| RESISC45 |
|
75 |
-
| GTSRB | 92.32 |
|
76 |
-
| KITTI |
|
77 |
-
| Country211 | 38.12 |
|
78 |
-
| PatchCamelyon |
|
79 |
-
| UCF101 |
|
80 |
-
| Kinetics-700 |
|
81 |
-
| CLEVR |
|
82 |
-
| Hateful Memes | 72.00 |
|
83 |
-
| SST-2 | 76.33 |
|
84 |
-
| ImageNet |
|
85 |
|
86 |
|
87 |
### C. Limitations
|
|
|
28 |
In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
|
29 |
|
30 |
| Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|
31 |
+
|:----------------|:----------------------|:----------------------|
|
32 |
+
| LLM | Qwen2.5-7B | Qwen2.5-7B |
|
33 |
+
| AI2D | <span style="color:red">76.98</span> | 73.15 |
|
34 |
+
| ScienceQA_img | <span style="color:red">78.09</span> | 76.35 |
|
35 |
+
| GQA | <span style="color:red">64.17</span> | 63.31 |
|
36 |
+
| InfoVQA_val | <span style="color:red">43.48</span> | 38.88 |
|
37 |
+
| MMBench_cn_dev | <span style="color:red">74.83</span> | 72.51 |
|
38 |
+
| MMBench_en_dev | <span style="color:red">76.37</span> | 74.57 |
|
39 |
+
| MME(cognition) | <span style="color:red">432</span> | 384 |
|
40 |
+
| MME(perception) | <span style="color:red">1598</span> | 1512 |
|
41 |
+
| SeedBench | <span style="color:red">68.20</span> | 66.80 |
|
42 |
+
| SeedBench_img | <span style="color:red">73.75</span> | 72.72 |
|
43 |
+
| MMStar | <span style="color:red">50.98</span> | 48.98 |
|
44 |
+
| MMMU | <span style="color:red">44.30</span> | 44.20 |
|
45 |
+
| OCRBench | <span style="color:red">531.00</span> | 525.00 |
|
46 |
+
| ChartQA | <span style="color:red">67.84</span> | 66.52 |
|
47 |
+
| DocVQA_val | <span style="color:red">76.46</span> | 75.21 |
|
48 |
+
| POPE | 88.69 | <span style="color:red">88.83</span> |
|
49 |
+
| TextVQA_val | 61.69 | <span style="color:red">62.47</span> |
|
50 |
|
51 |
|
52 |
|
|
|
56 |
|
57 |
| Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|
58 |
|:---------------|:----------------------|:----------------------|
|
59 |
+
| <span style="color:red">AVG</span> | <span style="color:red">87.15</span> | 85.35 |
|
60 |
+
| Food101 | <span style="color:red">96.21</span> | 95.90 |
|
61 |
+
| CIFAR-10 | <span style="color:red">99.36</span> | 97.90 |
|
62 |
+
| CIFAR-100 | <span style="color:red">93.69</span> | 87.40 |
|
63 |
+
| Birdsnap | <span style="color:red">88.18</span> | 79.90 |
|
64 |
+
| SUN397 | <span style="color:red">87.96</span> | 82.20 |
|
65 |
+
| Stanford Cars | <span style="color:red">95.16</span> | 91.50 |
|
66 |
+
| FGVC Aircraft | <span style="color:red">86.38</span> | 71.60 |
|
67 |
+
| Describable Textures Dataset | <span style="color:red">86.70</span> | 83.00 |
|
68 |
+
| Oxford-IIIT Pets | <span style="color:red">96.27</span> | 95.10 |
|
69 |
+
| Caltech-101 | <span style="color:red">97.92</span> | 96.00 |
|
70 |
+
| Flowers102 | <span style="color:red">99.58</span> | 99.20 |
|
71 |
+
| MNIST | 98.67 | <span style="color:red">99.20</span> |
|
72 |
+
| STL-10 | 99.28 | <span style="color:red">99.70</span> |
|
73 |
+
| EuroSAT | <span style="color:red">99.06</span> | 98.10 |
|
74 |
+
| RESISC45 | <span style="color:red">95.48</span> | 94.90 |
|
75 |
+
| GTSRB | 92.32 | <span style="color:red">92.40</span> |
|
76 |
+
| KITTI | <span style="color:red">75.39</span> | 69.20 |
|
77 |
+
| Country211 | 38.12 | <span style="color:red">46.40</span> |
|
78 |
+
| PatchCamelyon | <span style="color:red">88.00</span> | 85.60 |
|
79 |
+
| UCF101 | <span style="color:red">92.86</span> | 92.00 |
|
80 |
+
| Kinetics-700 | <span style="color:red">73.35</span> | 73.00 |
|
81 |
+
| CLEVR | <span style="color:red">64.40</span> | 60.30 |
|
82 |
+
| Hateful Memes | 72.00 | <span style="color:red">77.30</span> |
|
83 |
+
| SST-2 | 76.33 | <span style="color:red">80.50</span> |
|
84 |
+
| ImageNet | <span style="color:red">86.10</span> | 85.40 |
|
85 |
|
86 |
|
87 |
### C. Limitations
|