GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning
Abstract
A two-stage reinforcement learning framework improves geometric reasoning and problem-solving in multimodal language models by first enhancing visual perception.
Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.
Community
MLLMs often fail at basic geometry. Why? They struggle to accurately see the diagrams in the first place.
š The Problem:
Reasoning training for MLLMs is capped by a perceptual bottleneck. Models can't solve geometry problems if they can't correctly perceive basic shapes, angles, and spatial relationships in an image.
š Our Solution:
We introduce GeoPQA, a benchmark to measure this gap, and propose a two-stage training framework:
1ļøā£ Perception First: We train the MLLM to accurately identify geometric structures.
2ļøā£ Reasoning Second: With a solid visual foundation, we then train it on complex, multi-step reasoning.
š The results are exciting! Our two-stage approach boosts geometric problem-solving accuracy by 9.1% over traditional methods.
Our work highlights a key principle: for MLLMs to truly reason, they must first learn to see.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self-Rewarding Vision-Language Model via Reasoning Decomposition (2025)
- Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models (2025)
- BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning (2025)
- Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions (2025)
- Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback (2025)
- SIFThinker: Spatially-Aware Image Focus for Visual Reasoning (2025)
- MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper