Reinforcement Learning Acceleration By "Truncated Proximal Policy Optimization" Revolutionizing Efficiency Of Long Sentence Generation

14/07/2025

3 main points
✔️ T-PPO is a method that significantly improves the computational efficiency of PPO by learning long responses while breaking them up in the middle
✔️ EGAE is used to estimate dominance even from partial responses and perform policy updates sequentially
✔️ Outperforms conventional methods and up to 2.5x on the AIME mathematical reasoning benchmark Achieves up to 2.5x training efficiency

Truncated Proximal Policy Optimization
written by Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Bole Ma, Mofan Zhang, Gaohong Liu, Ru Zhang, Haotian Zhou, Cong Xie, Ruidong Zhu, Zhi Zhang, Xin Liu, Mingxuan Wang, Lin Yan, Yonghui Wu
(Submitted on 18 Jun 2025)
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

In this paper, we propose a new method that significantly improves the efficiency of Proximal Policy Optimization (PPO), a reinforcement learning method used to enhance the inference capability of LLMs, called "Truncated Proximal Policy Optimization (T-PPO) is proposed as a new method that significantly improves the efficiency of the Proximal Policy Optimization (PPO) method.

Conventional PPO tends to waste computational resources, especially when long outputs are required, such as in Chain-of-Thought inference, because the longer the response to be generated, the lower the training efficiency.

T-PPO, on the other hand, features sequential policy updates by utilizing partially generated outputs without waiting for the response to complete. The method introduces an estimation method called Extended Generalized Advantage Estimation (EGAE), which allows advantage (Advantage) to be calculated even from responses that are halfway through the process. In addition, the optimization of the policy model and the value model is performed simultaneously and independently to reduce computational redundancy.

Experiments show that the proposed method outperforms conventional methods using the AIME mathematical inference task, while improving training efficiency by up to 2.5 times.

Proposed Method

The heart of T-PPO lies in Extended Generalized Advantage Estimation (EGAE).

While traditional GAE allows advantage to be calculated only after the final response is obtained, EGAE is extended to allow accurate estimation even for partial outputs. Specifically, dominance is estimated by weighted summation, with the usual time difference error (TD error) calculated sequentially for states and actions obtained in the middle of the generation process.

In addition, a token filtering strategy is introduced, where the latest tokens of uncompleted responses are excluded from policy updates because they generate high variance, while all completed responses are used to train the value model. This mechanism dramatically improves the efficiency of GPU-based batch processing.In addition, T-PPO employs a batch strategy of sequential rollout, where partially completed sequences of generation are replaced in the next step.

This strategy reduces computation wait times due to response length diversity and maximizes resource utilization. Finally, it promotes policy and value optimization on a token-by-token basis, thus achieving both stable convergence and high sample efficiency.

Experiments

Experiments were conducted on the AIME mathematical inference dataset to verify the efficiency and stability of the proposed method. Qwen-2.5-Base-32B was used as the base model, and the policy learning was trained with a learning rate of 1e-6 and the value function with 2e-6.

The batch size was set to 512 prompts, sampling frequency was 16 times for each prompt, maximum response length was 24k tokens, and window length was 8k tokens. The evaluation compared T-PPO with conventional methods (PPO, PPO-EWMA, GePPO, VAPO, etc.) and found that T-PPO achieved a Pass@1 score of 62 on the AIME benchmark, the best results.

In addition, wall clock times were reduced by approximately 60% compared to PPO, confirming a 2.5x improvement in efficiency for the same number of steps. In addition, Roofline analysis showed that T-PPO significantly improved the computational intensity, indicating higher GPU utilization efficiency.

Response length evolution during training was also analyzed and confirmed that T-PPO maintains and improves its ability to generate long, non-monotonically varying, but ultimately stable, responses.

Categories related to this article

nakata

Reinforcement Learning Acceleration By "Truncated Proximal Policy Optimization" Revolutionizing Efficiency Of Long Sentence Generation

Overview

Proposed Method

Experiments

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Effectiveness?

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Ef ...

Combining Speed And Accuracy: Quantization-aware LLM Pre-training "QAP

Combining Speed And Accuracy: Quantization-aware LLM Pre-training "QAP

HiWave: Innovation In Wavelet Diffusion Generation For 4K Images Without Additional Learning

HiWave: Innovation In Wavelet Diffusion Generation For 4K Images Without Additional Learning

Forget-Me-Not: A Proposal For A Simple Prompting Technique To Prevent Forgetting Information In Long Prompts

Forget-Me-Not: A Proposal For A Simple Prompting Technique To Prevent Forgetting Information In Long ...