Model

We used the same Vision Transformer architecture ViT-L/14@336px as CLIP.

Data

Our model was trained on publicly available image-caption data from the LAION400M and COYO700M datasets.

Performance and Limitations

A. MLLMs Evaluation Results

In our experiments, we replaced the CLIP model in LLaVA-NeXT with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used Qwen2.5-7B. The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

Vision Tower	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
LLM	Qwen2.5-7B	Qwen2.5-7B
AI2D	76.98	73.15
ScienceQA_img	78.09	76.35
GQA	64.17	63.31
InfoVQA_val	43.48	38.88
MMBench_cn_dev	74.83	72.51
MMBench_en_dev	76.37	74.57
MME(cognition)	432	384
MME(perception)	1598	1512
SeedBench	68.20	66.80
SeedBench_img	73.75	72.72
MMStar	50.98	48.98
MMMU	44.30	44.20
OCRBench	531.00	525.00
ChartQA	67.84	66.52
DocVQA_val	76.46	75.21
POPE	88.69	88.83
TextVQA_val	61.69	62.47

B. Linear Probe Evaluation Results

This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

Dataset	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
AVG	87.15	85.35
Food101	96.21	95.90
CIFAR-10	99.36	97.90
CIFAR-100	93.69	87.40
Birdsnap	88.18	79.90
SUN397	87.96	82.20
Stanford Cars	95.16	91.50
FGVC Aircraft	86.38	71.60
Describable Textures Dataset	86.70	83.00
Oxford-IIIT Pets	96.27	95.10
Caltech-101	97.92	96.00
Flowers102	99.58	99.20
MNIST	98.67	99.20
STL-10	99.28	99.70
EuroSAT	99.06	98.10
RESISC45	95.48	94.90
GTSRB	92.32	92.40
KITTI	75.39	69.20
Country211	38.12	46.40
PatchCamelyon	88.00	85.60
UCF101	92.86	92.00
Kinetics-700	73.35	73.00
CLEVR	64.40	60.30
Hateful Memes	72.00	77.30
SST-2	76.33	80.50
ImageNet	86.30	85.40