On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
Abstract
Reinforcement learning with verifiable rewards improves language model reasoning by focusing on the direction of parameter updates rather than their magnitude, enabling better test-time extrapolation and training-time reweighting methods.
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the magnitude of these updates, largely overlooking their direction. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference Δlog p between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that Δlog p more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a test-time extrapolation method that amplifies the policy along the learned Δlog p direction to improve reasoning accuracy without further training; (2) a training-time reweighting method that focuses learning on low-probability (corresponding to higher Δlog p) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
Community
Kimi k2 thinks that this will improve hallucinations:
This work is about boosting structured reasoning performance. While it may indirectly help in math-heavy contexts where answers are verifiable, it does not improve factual reliability in general. In fact, without proper safeguards, it could exacerbate hallucinations in open-ended or knowledge-intensive tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards (2026)
- Rewards as Labels: Revisiting RLVR from a Classification Perspective (2026)
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification (2026)
- InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning (2026)
- Reinforcement Learning via Self-Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper