Evolution Of Llama To Support Reinforcement Learning, OctoThinker Shows The Power Of Intermediate Learning

28/07/2025

3 main points
✔️ Proposed a two-stage mid-training strategy "Stable-then-Decay" to improve RL suitability of Llama models
✔️ Showed that the use of high-quality mathematical corpora and long CoT data is effective in improving RL performance
✔️ OctoThinkershows thatQwen2. 5 and presents the potential to overcome Llama's limitations

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
written by Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
(Submitted on 25 Jun 2025)
Comments: 26 pages; The first three authors contribute to this work equally
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper is a study that clarifies what kind of mid-training (mid-training) is effective for general base models such as Llama in acquiring advanced inference capabilities through reinforcement learning (RL). In particular, we focus on the difference in RL scaling behavior between Qwen-based and Llama-based models, explore the causes of this difference, and propose learning strategies to improve RL performance in Llama models.

The research centers on a two-stage mid-training strategy called "Stable-then-Decay". The first stage fosters robust reasoning skills through stable learning, while the second stage uses different types of data (short thought processes, long thought processes, and mixtures thereof) to generate multi-branch (branch) models.

As a result, this new set of models, named OctoThinker, achieves performance comparable to RL-friendly models such as Qwen2.5 and shows that RL scaling is possible even for Llama series. We have also built and published a large mathematical reasoning corpus, MegaMath-Web-Pro-Max, which lays the groundwork for future research.

Proposed Methodology

We propose a two-stage mid-training strategy, "Stable-then-Decay," to transform a model such as Llama, which is considered to be unsuitable for RL, into an RL-scalable infrastructure model.

In the first stage, "Stable," we use 200B tokens of high-quality mathematical data (e.g., MegaMath-Web-Pro-Max) to perform stable pre-training at a constant learning rate. This phase forms the foundation for the model's basic reasoning skills and mathematical knowledge.

Then, in the second stage, "Decay," the model's properties are bifurcated by introducing data with different properties (e.g., short chain of thoughts, long chain of thoughts, or a mixture thereof) while the learning rate is progressively attenuated to foster a variety of inference styles. This process is named "OctoThinker" because it unfolds in multiple directions, like the multiple arms of an octopus.

During this process, the proportion and combination of QA-format data and instruction-following data were finely controlled, and the impact of each was evaluated in detail. In addition, a response-length control scheduler and prompt templates were devised to stabilize RL training.

Experiments

Experiments were conducted using Llama and Qwen as comparators to determine differences in learning behavior and performance in RL. Initial observations showed that while the Qwen model improved its performance steadily while gradually increasing the length of the response, the Llama model exhibited anomalous learning behavior, such as iterating until the response reached its maximum length (4096 tokens) in the middle of the process.

To address this, the authors applied the two-stage mid-training described above to Llama. In the first stage, stable training was performed for 200B tokens on high-quality data, mainly MegaMath-Web-Pro-Max, followed by branch training on three data configurations: short CoT, long CoT, and mixed.

We then performed RL training on each model under identical conditions and evaluated its performance on 14 mathematical inference benchmarks, including MATH500, GSM8K, OlympiadBench, and AMC23. The results showed that each of OctoThinker's branch models outperformed the original Llama by 10-20%, especially in the "Long" branch, reaching performance levels comparable to Qwen2.5.

Thus, we quantitatively demonstrate the impact of the mid-training strategy on RL performance and demonstrate that high-performance RL adaptation is possible in the Llama series.

Categories related to this article

nakata

Evolution Of Llama To Support Reinforcement Learning, OctoThinker Shows The Power Of Intermediate Learning

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation