Abstract
rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning, achieves state-of-the-art performance by efficiently handling complex problem-solving with advanced cognitive behaviors and minimal computational resources.
We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.
Community
Nice paper, congrats 🎉 Feel free to claim it by clicking your name on the author list. @lynazhang
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models (2025)
- Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR (2025)
- Agentic Reinforced Policy Optimization (2025)
- Posterior-GRPO: Rewarding Reasoning Processes in Code Generation (2025)
- VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning (2025)
- EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning (2025)
- Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
The idea is very interesting! I think the teaser of the paper could be clearer if it mentioned two points more explicitly:
- The base model is Qwen3-14B-Base, which has already been mid-trained with a large amount of reasoning data (so the RL stage isn’t exactly Zero).
- The method does not outperform Qwen3-14B itself, and the Qwen3-14B results were added to the first table.
I don’t believe this was done in bad faith, but the way the first page of the report is written comes across as a bit overstated.
Hi, @smahdavi4 , Thank you very much for the feedback!
On the first point, it is not entirely clear whether Qwen3-14B-Base received additional reasoning mid-training. In our experiments, after applying non-reasoning SFT, its performance on AIME24 was close to zero and the average rollout length with coding tools was around 1k tokens. In contrast, when we recently performed mid-training with long-CoT reasoning data and then applied non-reasoning SFT, the starting response length reached about 10k tokens. Based on this comparison, we are inclined to believe that Qwen3-14B-Base has not performed extra long-CoT mid-training beyond the second reasoning stage described in its technical report.
We would also like to note that the definition of “zero-RL” is not entirely clear. In our work, we adopt the same convention as most RL studies, where “zero-RL” refers to applying RL to a given pre-trained model. Since the exact reasoning-related training and data incorporated during pre-training are often unknown to those who did not participate in pre-training, this remains an open area worth further investigation.On the second point, we did not include Qwen3-14B-Official in the teaser because, as stated in the Qwen3 technical report, its post-training mainly involved large-scale SFT distilled from Qwen3-235B, rather than RL. That said, we do compare against Qwen3-14B in the main results (Table 3). As shown, our model outperforms Qwen3-14B on Math-500 (97.8 vs. 96.8), AIME24 (80.6 vs. 79.3), and HMMT 25 (52.7 vs. 48.9). On AIME25, we slightly underperform (69.8 vs. 70.4).
Thank you again for raising these questions. We hope this clarifies our reasoning, and we warmly welcome further discussion on these points.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper