MAmmoTH-VL-8B / README.md
KerwinJob's picture
Create README.md
990f01b verified
|
raw
history blame
3.38 kB

MAmmoTH-VL: liciting Multimodal Reasoning with Instruction Tuning at Scale

🏠 Homepage | πŸ€– MAmmoTH-VL-8B | πŸ’» Code | πŸ“„ Arxiv | πŸ“• PDF | πŸ–₯️ Demo

Abstract

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

Performance

Multi-Discipline Knowledge and Mathematical Reasoning Model Comparison

Model MMStar MMMU MMMU-Pro SeedBench MMBench MMVet MathVerse MathVista
GPT-4o 64.7 69.1 49.7 76.2 82.1 76.2 50.2 63.8
Gemini-1.5-Pro 59.1 65.8 44.4 76.0 73.9 64.0 - 63.9
Claude-3.5-Sonnet 62.2 68.3 48.0 72.2 79.7 75.4 - 67.7
InternVL2-LLaMa3-76B 67.1 58.2 38.0 77.6 86.5 64.4 - 65.5
Qwen2-VL-72B 68.6 64.5 37.1 77.9 86.9 73.9 37.3 70.5
LLaVA-OV-72B (SI) 65.2 57.4 26.0 77.6 86.6 60.0 37.7 66.5
LLaVA-OV-72B 66.1 56.8 24.0 78.0 85.9 63.7 39.1 67.5
MiniCPM-V-2.6-8B 57.5 49.8 21.7 74.0 81.5 60.0 - 60.6
InternLM-XComp-2.5-8B 59.9 42.9 - 75.4 74.4 51.7 20.0 59.6
InternVL-2-8B 59.4 49.3 25.4 76.0 81.7 60.0 27.5 58.3
Qwen2-VL-7B 60.7 52.1 26.9 74.3 83.0 62.0 28.2 58.2
Cambrian-1-8B - 42.7 14.7 73.3 74.6 48.0 - 49.0
Molmo-8B-D 50.5 45.3 18.9 74.1 73.6 58.0 21.5 51.6
LLaVA-OV-7B (SI) 60.9 47.3 16.8 74.8 80.5 58.8 26.9 56.1
LLaVA-OV-7B 61.7 48.8 18.7 75.4 80.8 58.6 26.2 63.2
model-8B (SI) 55.4 49.4 26.0 73.3 83.0 60.6 35.0 67.6
model-8B 63.0 50.8 25.3 76.0 83.4 62.3 34.2 67.6
Over Best Open-Source (7-8B Scale) +1.3 +2.0 +7.1 +0.6 +2.6 +3.7 +8.1 +4.4

Citing the Model

BibTeX Citation:

xxx