
RStar2-Agent: State-of-the-Art Mathematical Reasoning Reached By Efficient Agent-Based Reinforcement Learning With GRPO-RoC
3 main points
✔️ rStar2-Agent achieves mathematical inference performance beyond 671B models despite its 14B size
✔️ GRPO-RoC and highly efficient infrastructure enable reinforcement learning robust to environmental noise
✔️ Reaching state-of-the-art levels in just 510 steps and generalizing inference capabilities beyond mathematics
rStar2-Agent: Agentic Reasoning Technical Report
written by Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang
(Submitted on 28 Aug 2025)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Overview
This paper reports on the development and results of rStar2-Agent, a large-scale language model dedicated to mathematical reasoning.
Although this model is 14 billion parameters in size, it achieves performance comparable to the state-of-the-art level previously reached by models with 671 billion parameters.
The reason behind this is the limitation of conventional methods that rely on long Chain-of-Thought (CoT).
That is, it was difficult to detect intermediate errors and to change policy flexibly by simply "keeping" thinking for a long time.
To overcome this challenge, the authors introduced Agentic Reinforcement Learning (Agentic Reinforcement Learning), which aims to "make thought smarter.
Specifically, they use reinforcement learning to learn a mechanism in which models appropriately generate and execute Python code and improve their reasoning while reflecting the results.
To support this, a highly efficient code execution environment capable of handling 45,000 simultaneous tool calls, a new "GRPO-RoC" algorithm that reduces environmental noise, and efficient multi-stage learning recipes were designed.
The results show that the state-of-the-art can be reached in only 510 steps and 1 week of training, and that the reasoning capability can be generalized to non-mathematical areas.
Proposed Methodology
Our proposed method consists of three components for efficient large-scale deployment of agentic reinforcement learning.
First, the construction of an infrastructure to support large-scale code execution.
The authors designed a dedicated execution environment that can handle up to 45,000 parallel tool calls in an average of 0.3 seconds.
In addition, a scheduler that dynamically allocates GPU compute resources was implemented to eliminate load bias.
Second, a new algorithm called GRPO-RoC (Group Relative Policy Optimization with Resampling on Correct).
This is a method that preferentially reinforces trajectories with the fewest tool errors and format violations among trajectories that obtain correct answers, while failure trajectories are used for training by maintaining their diversity.
This makes learning robust to environmental noise while preventing reward hacking.
Third, an efficient learning recipe.
Instead of performing inference-oriented SFT (supervised fine-tuning) as in the past, we first learned only the basics of simple instruction following and tool use, and then strengthened the inference capability step by step with multi-stage RL.
This trifecta allows us to build practical and powerful inference agents with smaller computational resources than previously possible.
Experiments
In our experiments, we evaluated the performance of rStar2-Agent-14B on challenging benchmarks such as AIME24, AIME25, and HMMT25.
The results showed that rStar2-Agent-14B achieved a correct response rate of 80.6% for AIME24 and 69.8% for AIME25, which is higher than DeepSeek-R1 (671B) and Claude-Opus-4.0.
The average response length was also short, indicating lean and efficient inference.
During the training process, the performance improvement at each stage was clearly confirmed.
In the first stage, basic inference capability was acquired under the 8K response length token limit, and further improvement in accuracy was achieved by extending the limit to 12K in the second stage.
In the final stage, the training concentrated on more difficult problems to reach the state-of-the-art level.
In addition, generalization performance was confirmed in areas other than mathematics, with strong results on the scientific reasoning benchmark GPQA-Diamond and on the agent-like tool use task BFCL v3.
In addition, analysis of error trajectories and self-reflexive behavior revealed that the model is learning "reflection token" behavior, which actively utilizes feedback from the environment to improve inference through trial and error.
This confirmed that the method not only improves performance, but also mimics a more human-like thought process.
Categories related to this article