DeepGlint-AI/llava-mlcd-qwen2.5-7b

Model

We used MLCD as the Vision Encoder in LLaVA-Next.

Data

Our model was trained on publicly available data from the LLaVA-Pretrain and LLaVA-NeXT-Data datasets.

How to eval

pip install lmms-eval==0.2.0

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m accelerate.commands.launch \
  --main_process_port=12581 \
  --num_processes=8 \
  -m lmms_eval \
  --model llava \
  --model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \
  --tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix mlcd_llava_qwen2_7b \
  --output_path ./log

Performance and Limitations

In our experiments, we replaced the CLIP model in LLaVA-NeXT with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used Qwen2.5-7B. The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

Vision Tower	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
LLM	Qwen2.5-7B	Qwen2.5-7B
AI2D	76.98	73.15
ScienceQA_img	78.09	76.35
GQA	64.17	63.31
InfoVQA_val	43.48	38.88
MMBench_cn_dev	74.83	72.51
MMBench_en_dev	76.37	74.57
MME(cognition)	432	384
MME(perception)	1598	1512
SeedBench	68.20	66.80
SeedBench_img	73.75	72.72
MMStar	50.98	48.98
MMMU	44.30	44.20
OCRBench	531.00	525.00
ChartQA	67.84	66.52
DocVQA_val	76.46	75.21
POPE	88.69	88.83
TextVQA_val	61.69	62.47

C. Limitations

Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.

Acknowledgments

We would like to express our gratitude to Yumeng Wang for his significant contributions to the experimental validation in MLLMs.

DeepGlint-AI
/

llava-mlcd-qwen2.5-7b

Model

Data

How to eval

Performance and Limitations

C. Limitations

Acknowledgments

Model tree for DeepGlint-AI/llava-mlcd-qwen2.5-7b

Datasets used to train DeepGlint-AI/llava-mlcd-qwen2.5-7b

Collection including DeepGlint-AI/llava-mlcd-qwen2.5-7b

MLCD