What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
Abstract
Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.
Community
Reinforcement learning (RL) has become a common post-training stage for improving visual reasoning in multimodal models, but what exactly does RL improve internally?
This paper introduces a Frankenstein-style causal analysis framework to dissect the role of RL in vision-language models. Instead of relying solely on end-to-end benchmark gains, the authors perform structured model merging across early, middle, and late transformer blocks to localize where RL induces functional changes.
Check our code: https://github.com/tianyi-lab/Frankenstein
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs (2026)
- Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy (2026)
- Vision-aligned Latent Reasoning for Multi-modal Large Language Model (2026)
- CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation (2026)
- Video-KTR: Reinforcing Video Reasoning via Key Token Attribution (2026)
- SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
