MAmmoTH-VL: liciting Multimodal Reasoning with Instruction Tuning at Scale

Abstract

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

Performance

Multi-Discipline Knowledge and Mathematical Reasoning Model Comparison

Model	MMStar	MMMU	MMMU-Pro	SeedBench	MMBench	MMVet	MathVerse	MathVista
GPT-4o	64.7	69.1	49.7	76.2	82.1	76.2	50.2	63.8
Gemini-1.5-Pro	59.1	65.8	44.4	76.0	73.9	64.0	-	63.9
Claude-3.5-Sonnet	62.2	68.3	48.0	72.2	79.7	75.4	-	67.7
InternVL2-LLaMa3-76B	67.1	58.2	38.0	77.6	86.5	64.4	-	65.5
Qwen2-VL-72B	68.6	64.5	37.1	77.9	86.9	73.9	37.3	70.5
LLaVA-OV-72B (SI)	65.2	57.4	26.0	77.6	86.6	60.0	37.7	66.5
LLaVA-OV-72B	66.1	56.8	24.0	78.0	85.9	63.7	39.1	67.5
MiniCPM-V-2.6-8B	57.5	49.8	21.7	74.0	81.5	60.0	-	60.6
InternLM-XComp-2.5-8B	59.9	42.9	-	75.4	74.4	51.7	20.0	59.6
InternVL-2-8B	59.4	49.3	25.4	76.0	81.7	60.0	27.5	58.3
Qwen2-VL-7B	60.7	52.1	26.9	74.3	83.0	62.0	28.2	58.2
Cambrian-1-8B	-	42.7	14.7	73.3	74.6	48.0	-	49.0
Molmo-8B-D	50.5	45.3	18.9	74.1	73.6	58.0	21.5	51.6
LLaVA-OV-7B (SI)	60.9	47.3	16.8	74.8	80.5	58.8	26.9	56.1
LLaVA-OV-7B	61.7	48.8	18.7	75.4	80.8	58.6	26.2	63.2
model-8B (SI)	55.4	49.4	26.0	73.3	83.0	60.6	35.0	67.6
model-8B	63.0	50.8	25.3	76.0	83.4	62.3	34.2	67.6
Over Best Open-Source (7-8B Scale)	+1.3	+2.0	+7.1	+0.6	+2.6	+3.7	+8.1	+4.4

Citing the Model

BibTeX Citation:

xxx