MAmmoTH-VL-8B / README.md
KerwinJob's picture
Update README.md
9544b88 verified
|
raw
history blame
6.19 kB

MAmmoTH-VL-8B

🏠 Homepage | πŸ€– MAmmoTH-VL-8B | πŸ’» Code | πŸ“„ Arxiv | πŸ“• PDF | πŸ–₯️ Demo

Abstract

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

Performance

Multi-Discipline Knowledge and Mathematical Reasoning

Model MMStar MMMU MMMU-Pro SeedBench MMBench MMVet MathVerse MathVista
GPT-4o 64.7 69.1 49.7 76.2 82.1 76.2 50.2 63.8
Gemini-1.5-Pro 59.1 65.8 44.4 76.0 73.9 64.0 - 63.9
Claude-3.5-Sonnet 62.2 68.3 48.0 72.2 79.7 75.4 - 67.7
InternVL2-LLaMa3-76B 67.1 58.2 38.0 77.6 86.5 64.4 - 65.5
Qwen2-VL-72B-Ins 68.6 64.5 37.1 77.9 86.9 73.9 37.3 70.5
LLaVA-OV-72B (SI) 65.2 57.4 26.0 77.6 86.6 60.0 37.7 66.5
LLaVA-OV-72B 66.1 56.8 24.0 78.0 85.9 63.7 39.1 67.5
MiniCPM-V-2.6-8B 57.5 49.8 21.7 74.0 81.5 60.0 - 60.6
InternLM-XComp-2.5-8B 59.9 42.9 - 75.4 74.4 51.7 20.0 59.6
Llama-3.2-11B-Vision-Ins 49.8 50.7 23.7 72.7 73.2 57.6 23.6 51.5
InternVL-2-8B 59.4 49.3 25.4 76.0 81.7 60.0 27.5 58.3
Qwen2-VL-7B-Ins 60.7 52.1 26.9 74.3 83.0 62.0 28.2 58.2
Cambrian-1-8B - 42.7 14.7 73.3 74.6 48.0 - 49.0
Llava-CoT-11B 57.6 48.9 18.5 75.2 75.0 60.3 24.2 54.8
Molmo-8B-D 50.5 45.3 18.9 74.1 73.6 58.0 21.5 51.6
LLaVA-OV-7B (SI) 60.9 47.3 16.8 74.8 80.5 58.8 26.9 56.1
LLaVA-OV-7B 61.7 48.8 18.7 75.4 80.8 58.6 26.2 63.2
model-8B (SI) 55.4 49.4 26.0 73.3 83.0 60.6 35.0 67.6
model-8B 63.0 50.8 25.3 76.0 83.4 62.3 34.2 67.6
Over Best Open-Source (~10B Scale) +1.3 +1.9 +7.1 +0.6 +2.6 +2.0 +8.1 +4.4

Chart & Doc Understanding and Multimodal Interactions & Preferences

Model AI2D test ChartQA test InfoVQA test DocVQA test RealWorldQA test WildVision 0617 L-Wilder small
GPT-4o 94.2 85.7 79.2 92.8 76.5 89.4 85.9
Gemini-1.5-Pro 94.4 87.2 81.0 93.1 70.4 - -
Claude-3.5-Sonnet 94.7 90.8 49.7 95.2 59.9 50.0 83.1
InternVL2-LLaMa3-76B 88.4 88.4 82.0 94.1 72.7 - -
Qwen2-VL-72B-Ins 88.1 88.3 84.5 96.5 77.8 - -
LLaVA-OV-72B (SI) 85.1 84.9 74.6 91.8 73.8 49.5 72.9
LLaVA-OV-72B 85.6 83.7 74.9 91.3 71.9 52.3 72.0
MiniCPM-V-2.6-7B 82.1 82.4 - 90.8 65.0 11.7 -
InternLM-XComp-2.5-7B 81.5 82.2 70.0 90.9 67.8 - 61.4
Llama-3.2-11B-Vision-Ins 77.3 83.4 65.0 88.4 63.3 49.7 62.0
InternVL-2-8B 83.8 83.3 74.8 91.6 64.4 51.5 62.5
Qwen2-VL-7B-Ins 83.0 83.0 76.5 94.5 70.1 44.0 66.3
Cambrian-1-8B 73.3 73.3 41.6 77.8 64.2 - -
Llava-CoT-11B - 67.0 44.8 - - - 65.3
Molmo-7B-D 81.0 84.1 72.6 92.2 70.7 40.0 -
LLaVA-OV-7B (SI) 81.6 78.8 65.3 86.9 65.5 39.2 69.1
LLaVA-OV-7B 81.4 80.0 68.8 87.5 66.3 53.8 67.8
MAmmoTH-VL-8B (SI) 83.4 85.9 74.8 93.8 71.3 51.9 71.3
MAmmoTH-VL-8B 84.0 86.2 73.1 93.7 69.9 51.1 70.8
Over Best Open-Source (~10B Scale) +2.4 +2.1 +2.2 +1.6 +0.6 -1.9 +2.2

Multi-Image and Video

Model MuirBench test MEGABench test EgoSchema test PerceptionTest test SeedBench video MLVU dev MVBench test VideoMME w/o subs
GPT-4o 68.0 54.2 - - - 64.6 - 71.9
GPT-4v 62.3 - - - 60.5 49.2 43.5 59.9
LLaVA-OV-72B (SI) 33.2 - 58.6 62.3 60.9 60.9 57.1 64.8
LLaVA-OV-72B 54.8 33.8 62.0 66.9 62.1 66.4 59.4 66.2
InternVL-2-8B 59.4 27.7 54.2 57.4 54.9 30.2 66.4 54.0
Qwen2-VL-7B-Ins 41.6 36.0 66.7 62.3 55.3 58.6 67.0 63.3
LLaVA-OV-7B (SI) 32.7 22.1 52.9 54.9 51.1 60.2 51.2 55.0
LLaVA-OV-7B 41.8 23.9 60.1 57.1 56.9 64.7 56.7 58.2
MAmmoTH-VL-8B 55.1 28.2 58.5 59.3 57.1 64.7 59.1 58.8
Over Best Open-Source (~10B Scale) +13.3 +4.3 -1.6 +2.2 +0.2 +0 +2.4 +0.6

Citing the Model

BibTeX Citation:

xxx