DeepGlint-AI
/

mlcd-vit-large-patch14-336

Feature Extraction

Safetensors

clip_vision_model

Vision

LLaVA

Model card Files Files and versions Community

xiangan commited on Nov 2, 2024

Commit

6c57ce8

verified ·

1 Parent(s): 4c07b27

Update README.md

Browse files

Files changed (1) hide show

README.md +45 -45

README.md CHANGED Viewed

@@ -28,25 +28,25 @@ Our model was trained on publicly available image-caption data from the [LAION40
 In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
 | Vision Tower    | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
-|:----------------|:-------------|:-------------|
-| LLM             | Qwen2.5-7B   |   Qwen2.5-7B |
-| AI2D            | **76.98**    | 73.15        |
-| ScienceQA_img   | **78.09**    | 76.35        |
-| GQA             | **64.17**    | 63.31        |
-| InfoVQA_val     | **43.48**    | 38.88        |
-| MMBench_cn_dev  | **74.83**    | 72.51        |
-| MMBench_en_dev  | **76.37**    | 74.57        |
-| MME(cognition)  | **432**      | 384          |
-| MME(perception) | **1598**     | 1512         |
-| SeedBench       | **68.20**    | 66.80        |
-| SeedBench_img   | **73.75**    | 72.72        |
-| MMStar          | **50.98**    | 48.98        |
-| MMMU            | **44.30**    | 44.20        |
-| OCRBench        | **531.00**   | 525.00       |
-| ChartQA         | **67.84**    | 66.52        |
-| DocVQA_val      | **76.46**    | 75.21        |
-| POPE            | 88.69        | **88.83**    |
-| TextVQA_val     | 61.69        | **62.47**    |
@@ -56,32 +56,32 @@ This table presents the results of linear probe evaluations comparing CLIP and M
 | Dataset        | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
 |:---------------|:----------------------|:----------------------|
-| **AVG**        | **87.15**             | 85.35                 |
-| Food101        | **96.21**             | 95.90                 |
-| CIFAR-10       | **99.36**             | 97.90                 |
-| CIFAR-100      | **93.69**             | 87.40                 |
-| Birdsnap       | **88.18**             | 79.90                 |
-| SUN397         | **87.96**             | 82.20                 |
-| Stanford Cars  | **95.16**             | 91.50                 |
-| FGVC Aircraft  | **86.38**             | 71.60                 |
-| Describable Textures Dataset | **86.70** | 83.00                 |
-| Oxford-IIIT Pets | **96.27**          | 95.10                 |
-| Caltech-101    | **97.92**             | 96.00                 |
-| Flowers102     | **99.58**             | 99.20                 |
-| MNIST          | 98.67                 | **99.20**             |
-| STL-10         | 99.28                 | **99.70**             |
-| EuroSAT        | **99.06**             | 98.10                 |
-| RESISC45       | **95.48**             | 94.90                 |
-| GTSRB          | 92.32                 | **92.40**             |
-| KITTI          | **75.39**             | 69.20                 |
-| Country211     | 38.12                 | **46.40**             |
-| PatchCamelyon  | **88.00**             | 85.60                 |
-| UCF101         | **92.86**             | 92.00                 |
-| Kinetics-700   | **73.35**             | 73.00                 |
-| CLEVR          | **64.40**             | 60.30                 |
-| Hateful Memes  | 72.00                 | **77.30**             |
-| SST-2          | 76.33                 | **80.50**             |
-| ImageNet       | **86.10**             | 85.40                 |
 ### C. Limitations

 In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
 | Vision Tower    | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
+|:----------------|:----------------------|:----------------------|
+| LLM             | Qwen2.5-7B            | Qwen2.5-7B            |
+| AI2D            | <span style="color:red">76.98</span>           | 73.15                 |
+| ScienceQA_img   | <span style="color:red">78.09</span>           | 76.35                 |
+| GQA             | <span style="color:red">64.17</span>           | 63.31                 |
+| InfoVQA_val     | <span style="color:red">43.48</span>           | 38.88                 |
+| MMBench_cn_dev  | <span style="color:red">74.83</span>           | 72.51                 |
+| MMBench_en_dev  | <span style="color:red">76.37</span>           | 74.57                 |
+| MME(cognition)  | <span style="color:red">432</span>             | 384                   |
+| MME(perception) | <span style="color:red">1598</span>            | 1512                  |
+| SeedBench       | <span style="color:red">68.20</span>           | 66.80                 |
+| SeedBench_img   | <span style="color:red">73.75</span>           | 72.72                 |
+| MMStar          | <span style="color:red">50.98</span>           | 48.98                 |
+| MMMU            | <span style="color:red">44.30</span>           | 44.20                 |
+| OCRBench        | <span style="color:red">531.00</span>          | 525.00                |
+| ChartQA         | <span style="color:red">67.84</span>           | 66.52                 |
+| DocVQA_val      | <span style="color:red">76.46</span>           | 75.21                 |
+| POPE            | 88.69                 | <span style="color:red">88.83</span>  |
+| TextVQA_val     | 61.69                 | <span style="color:red">62.47</span>  |
 | Dataset        | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
 |:---------------|:----------------------|:----------------------|
+| <span style="color:red">AVG</span>        | <span style="color:red">87.15</span>             | 85.35                 |
+| Food101        | <span style="color:red">96.21</span>             | 95.90                 |
+| CIFAR-10       | <span style="color:red">99.36</span>             | 97.90                 |
+| CIFAR-100      | <span style="color:red">93.69</span>             | 87.40                 |
+| Birdsnap       | <span style="color:red">88.18</span>             | 79.90                 |
+| SUN397         | <span style="color:red">87.96</span>             | 82.20                 |
+| Stanford Cars  | <span style="color:red">95.16</span>             | 91.50                 |
+| FGVC Aircraft  | <span style="color:red">86.38</span>             | 71.60                 |
+| Describable Textures Dataset | <span style="color:red">86.70</span> | 83.00                 |
+| Oxford-IIIT Pets | <span style="color:red">96.27</span>          | 95.10                 |
+| Caltech-101    | <span style="color:red">97.92</span>             | 96.00                 |
+| Flowers102     | <span style="color:red">99.58</span>             | 99.20                 |
+| MNIST          | 98.67                 | <span style="color:red">99.20</span>             |
+| STL-10         | 99.28                 | <span style="color:red">99.70</span>             |
+| EuroSAT        | <span style="color:red">99.06</span>             | 98.10                 |
+| RESISC45       | <span style="color:red">95.48</span>             | 94.90                 |
+| GTSRB          | 92.32                 | <span style="color:red">92.40</span>             |
+| KITTI          | <span style="color:red">75.39</span>             | 69.20                 |
+| Country211     | 38.12                 | <span style="color:red">46.40</span>             |
+| PatchCamelyon  | <span style="color:red">88.00</span>             | 85.60                 |
+| UCF101         | <span style="color:red">92.86</span>             | 92.00                 |
+| Kinetics-700   | <span style="color:red">73.35</span>             | 73.00                 |
+| CLEVR          | <span style="color:red">64.40</span>             | 60.30                 |
+| Hateful Memes  | 72.00                 | <span style="color:red">77.30</span>             |
+| SST-2          | 76.33                 | <span style="color:red">80.50</span>             |
+| ImageNet       | <span style="color:red">86.10</span>             | 85.40                 |
 ### C. Limitations