Feature Extraction
Safetensors
clip_vision_model
Vision
LLaVA
xiangan commited on
Commit
6c57ce8
·
verified ·
1 Parent(s): 4c07b27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -45
README.md CHANGED
@@ -28,25 +28,25 @@ Our model was trained on publicly available image-caption data from the [LAION40
28
  In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
29
 
30
  | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
31
- |:----------------|:-------------|:-------------|
32
- | LLM | Qwen2.5-7B | Qwen2.5-7B |
33
- | AI2D | **76.98** | 73.15 |
34
- | ScienceQA_img | **78.09** | 76.35 |
35
- | GQA | **64.17** | 63.31 |
36
- | InfoVQA_val | **43.48** | 38.88 |
37
- | MMBench_cn_dev | **74.83** | 72.51 |
38
- | MMBench_en_dev | **76.37** | 74.57 |
39
- | MME(cognition) | **432** | 384 |
40
- | MME(perception) | **1598** | 1512 |
41
- | SeedBench | **68.20** | 66.80 |
42
- | SeedBench_img | **73.75** | 72.72 |
43
- | MMStar | **50.98** | 48.98 |
44
- | MMMU | **44.30** | 44.20 |
45
- | OCRBench | **531.00** | 525.00 |
46
- | ChartQA | **67.84** | 66.52 |
47
- | DocVQA_val | **76.46** | 75.21 |
48
- | POPE | 88.69 | **88.83** |
49
- | TextVQA_val | 61.69 | **62.47** |
50
 
51
 
52
 
@@ -56,32 +56,32 @@ This table presents the results of linear probe evaluations comparing CLIP and M
56
 
57
  | Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
58
  |:---------------|:----------------------|:----------------------|
59
- | **AVG** | **87.15** | 85.35 |
60
- | Food101 | **96.21** | 95.90 |
61
- | CIFAR-10 | **99.36** | 97.90 |
62
- | CIFAR-100 | **93.69** | 87.40 |
63
- | Birdsnap | **88.18** | 79.90 |
64
- | SUN397 | **87.96** | 82.20 |
65
- | Stanford Cars | **95.16** | 91.50 |
66
- | FGVC Aircraft | **86.38** | 71.60 |
67
- | Describable Textures Dataset | **86.70** | 83.00 |
68
- | Oxford-IIIT Pets | **96.27** | 95.10 |
69
- | Caltech-101 | **97.92** | 96.00 |
70
- | Flowers102 | **99.58** | 99.20 |
71
- | MNIST | 98.67 | **99.20** |
72
- | STL-10 | 99.28 | **99.70** |
73
- | EuroSAT | **99.06** | 98.10 |
74
- | RESISC45 | **95.48** | 94.90 |
75
- | GTSRB | 92.32 | **92.40** |
76
- | KITTI | **75.39** | 69.20 |
77
- | Country211 | 38.12 | **46.40** |
78
- | PatchCamelyon | **88.00** | 85.60 |
79
- | UCF101 | **92.86** | 92.00 |
80
- | Kinetics-700 | **73.35** | 73.00 |
81
- | CLEVR | **64.40** | 60.30 |
82
- | Hateful Memes | 72.00 | **77.30** |
83
- | SST-2 | 76.33 | **80.50** |
84
- | ImageNet | **86.10** | 85.40 |
85
 
86
 
87
  ### C. Limitations
 
28
  In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
29
 
30
  | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
31
+ |:----------------|:----------------------|:----------------------|
32
+ | LLM | Qwen2.5-7B | Qwen2.5-7B |
33
+ | AI2D | <span style="color:red">76.98</span> | 73.15 |
34
+ | ScienceQA_img | <span style="color:red">78.09</span> | 76.35 |
35
+ | GQA | <span style="color:red">64.17</span> | 63.31 |
36
+ | InfoVQA_val | <span style="color:red">43.48</span> | 38.88 |
37
+ | MMBench_cn_dev | <span style="color:red">74.83</span> | 72.51 |
38
+ | MMBench_en_dev | <span style="color:red">76.37</span> | 74.57 |
39
+ | MME(cognition) | <span style="color:red">432</span> | 384 |
40
+ | MME(perception) | <span style="color:red">1598</span> | 1512 |
41
+ | SeedBench | <span style="color:red">68.20</span> | 66.80 |
42
+ | SeedBench_img | <span style="color:red">73.75</span> | 72.72 |
43
+ | MMStar | <span style="color:red">50.98</span> | 48.98 |
44
+ | MMMU | <span style="color:red">44.30</span> | 44.20 |
45
+ | OCRBench | <span style="color:red">531.00</span> | 525.00 |
46
+ | ChartQA | <span style="color:red">67.84</span> | 66.52 |
47
+ | DocVQA_val | <span style="color:red">76.46</span> | 75.21 |
48
+ | POPE | 88.69 | <span style="color:red">88.83</span> |
49
+ | TextVQA_val | 61.69 | <span style="color:red">62.47</span> |
50
 
51
 
52
 
 
56
 
57
  | Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
58
  |:---------------|:----------------------|:----------------------|
59
+ | <span style="color:red">AVG</span> | <span style="color:red">87.15</span> | 85.35 |
60
+ | Food101 | <span style="color:red">96.21</span> | 95.90 |
61
+ | CIFAR-10 | <span style="color:red">99.36</span> | 97.90 |
62
+ | CIFAR-100 | <span style="color:red">93.69</span> | 87.40 |
63
+ | Birdsnap | <span style="color:red">88.18</span> | 79.90 |
64
+ | SUN397 | <span style="color:red">87.96</span> | 82.20 |
65
+ | Stanford Cars | <span style="color:red">95.16</span> | 91.50 |
66
+ | FGVC Aircraft | <span style="color:red">86.38</span> | 71.60 |
67
+ | Describable Textures Dataset | <span style="color:red">86.70</span> | 83.00 |
68
+ | Oxford-IIIT Pets | <span style="color:red">96.27</span> | 95.10 |
69
+ | Caltech-101 | <span style="color:red">97.92</span> | 96.00 |
70
+ | Flowers102 | <span style="color:red">99.58</span> | 99.20 |
71
+ | MNIST | 98.67 | <span style="color:red">99.20</span> |
72
+ | STL-10 | 99.28 | <span style="color:red">99.70</span> |
73
+ | EuroSAT | <span style="color:red">99.06</span> | 98.10 |
74
+ | RESISC45 | <span style="color:red">95.48</span> | 94.90 |
75
+ | GTSRB | 92.32 | <span style="color:red">92.40</span> |
76
+ | KITTI | <span style="color:red">75.39</span> | 69.20 |
77
+ | Country211 | 38.12 | <span style="color:red">46.40</span> |
78
+ | PatchCamelyon | <span style="color:red">88.00</span> | 85.60 |
79
+ | UCF101 | <span style="color:red">92.86</span> | 92.00 |
80
+ | Kinetics-700 | <span style="color:red">73.35</span> | 73.00 |
81
+ | CLEVR | <span style="color:red">64.40</span> | 60.30 |
82
+ | Hateful Memes | 72.00 | <span style="color:red">77.30</span> |
83
+ | SST-2 | 76.33 | <span style="color:red">80.50</span> |
84
+ | ImageNet | <span style="color:red">86.10</span> | 85.40 |
85
 
86
 
87
  ### C. Limitations