
Reinforcement Learning Acceleration By "Truncated Proximal Policy Optimization" Revolutionizing Efficiency Of Long Sentence Generation
3 main points
✔️ T-PPO is a method that significantly improves the computational efficiency of PPO by learning long responses while breaking them up in the middle
✔️ EGAE is used to estimate dominance even from partial responses and perform policy updates sequentially
✔️ Outperforms conventional methods and up to 2.5x on the AIME mathematical reasoning benchmark Achieves up to 2.5x training efficiency
Truncated Proximal Policy Optimization
written by Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Bole Ma, Mofan Zhang, Gaohong Liu, Ru Zhang, Haotian Zhou, Cong Xie, Ruidong Zhu, Zhi Zhang, Xin Liu, Mingxuan Wang, Lin Yan, Yonghui Wu
(Submitted on 18 Jun 2025)
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Overview
In this paper, we propose a new method that significantly improves the efficiency of Proximal Policy Optimization (PPO), a reinforcement learning method used to enhance the inference capability of LLMs, called "Truncated Proximal Policy Optimization (T-PPO) is proposed as a new method that significantly improves the efficiency of the Proximal Policy Optimization (PPO) method.
Conventional PPO tends to waste computational resources, especially when long outputs are required, such as in Chain-of-Thought inference, because the longer the response to be generated, the lower the training efficiency.
T-PPO, on the other hand, features sequential policy updates by utilizing partially generated outputs without waiting for the response to complete. The method introduces an estimation method called Extended Generalized Advantage Estimation (EGAE), which allows advantage (Advantage) to be calculated even from responses that are halfway through the process. In addition, the optimization of the policy model and the value model is performed simultaneously and independently to reduce computational redundancy.
Experiments show that the proposed method outperforms conventional methods using the AIME mathematical inference task, while improving training efficiency by up to 2.5 times.
Proposed Method
The heart of T-PPO lies in Extended Generalized Advantage Estimation (EGAE).
While traditional GAE allows advantage to be calculated only after the final response is obtained, EGAE is extended to allow accurate estimation even for partial outputs. Specifically, dominance is estimated by weighted summation, with the usual time difference error (TD error) calculated sequentially for states and actions obtained in the middle of the generation process.
In addition, a token filtering strategy is introduced, where the latest tokens of uncompleted responses are excluded from policy updates because they generate high variance, while all completed responses are used to train the value model. This mechanism dramatically improves the efficiency of GPU-based batch processing.In addition, T-PPO employs a batch strategy of sequential rollout, where partially completed sequences of generation are replaced in the next step.
This strategy reduces computation wait times due to response length diversity and maximizes resource utilization. Finally, it promotes policy and value optimization on a token-by-token basis, thus achieving both stable convergence and high sample efficiency.
Experiments
Experiments were conducted on the AIME mathematical inference dataset to verify the efficiency and stability of the proposed method. Qwen-2.5-Base-32B was used as the base model, and the policy learning was trained with a learning rate of 1e-6 and the value function with 2e-6.
The batch size was set to 512 prompts, sampling frequency was 16 times for each prompt, maximum response length was 24k tokens, and window length was 8k tokens. The evaluation compared T-PPO with conventional methods (PPO, PPO-EWMA, GePPO, VAPO, etc.) and found that T-PPO achieved a Pass@1 score of 62 on the AIME benchmark, the best results.
In addition, wall clock times were reduced by approximately 60% compared to PPO, confirming a 2.5x improvement in efficiency for the same number of steps. In addition, Roofline analysis showed that T-PPO significantly improved the computational intensity, indicating higher GPU utilization efficiency.
Response length evolution during training was also analyzed and confirmed that T-PPO maintains and improves its ability to generate long, non-monotonically varying, but ultimately stable, responses.
Categories related to this article