VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

03/10/2025

3 main points
✔️ VCRL is a reinforcement learning method that dynamically adjusts sample difficulty using reward variance
✔️ Focused learning of high-dispersion samples and reuse in memory banks for efficiency and stability
✔️ Consistently outperforms existing methods on mathematical reasoning benchmarks and improves adaptation to difficult problems

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models
written by Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang
(Submitted on 24 Sep 2025)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

This paper proposes a new reinforcement learning method, Variance-based Curriculum Reinforcement Learning (VCRL), to improve LLM's mathematical reasoning ability.

Traditional reinforcement learning methods, such as GRPO, DAPO, GSPO, and other roll-out approaches, have strengths in learning with a variety of samples, but lack mechanisms to adjust the difficulty of the samples according to the learning stage of the model.
The problem was that it did not take into account the curriculum learning flow, in which human learning progresses from "easy problems to difficult problems.

VCRL solves this problem from the perspective of "reward dispersion. That is, based on the idea that the reward variance of the rollout results reflects the sample difficulty, samples with large variance are selected as learning targets with emphasis.
In addition, by introducing Replay Learning, high-value samples are accumulated in the memory bank, resulting in efficient and stable training.

Experiments were conducted using five mathematical reasoning benchmarks, including AIME, MATH500, and OlympiadBench, and the results consistently outperform existing methods.

Proposed Methodology

The proposed method, VCRL, consists of two components.

The first is Variance-based Dynamic Sampling.
Multiple rollouts are generated for each sample and the variance of their rewards is calculated.
For samples that are too easy, all rewards are close to 1. Conversely, for samples that are too difficult, all rewards are close to 0, so the variance is reduced.
On the other hand, the variance is maximized for samples of medium difficulty, where the number of correct and incorrect answers is split 50-50.
We believe that these "high variance samples" are the most effective for learning, and have established a mechanism to preferentially include them in training.

Second, we introduced "Replay Learning.
This is to improve the stability of training by keeping high-value samples in the memory bank and reusing them as needed.
Specifically, the samples whose variance values are below the threshold are excluded from the batch, and the memory bank is replenished with high-value samples instead.

Momentum is then used to update the memory bank to maintain up-to-dateness and diversity.
Through these two mechanisms, VCRL dynamically adapts the model's learning ability and sample difficulty to achieve efficient and stable reinforcement learning.

Experiments

Experiments focused on mathematical reasoning tasks and were conducted using five benchmarks (AIME-2024, AIME-2025, MATH500, OlympiadBench, and AMC23).

Models Qwen3-4B-Base and Qwen3-8B-Base were employed, and existing reinforcement learning methods such as GRPO, DAPO, and GSPO were used for comparison.
A dataset of 17,000 math problems (DAPO-Math-17K) was used for training, generating 16 rollouts per sample, 128 per batch, with 500 training steps.

As a result, VCRL showed the best performance on all benchmarks and on both models.
In particular, significant improvements were seen in the more challenging AIME-2024 and AIME-2025, with a significant increase in the average score for Qwen3-8B-Base from 32.96 to 57.76 for the base model.

In addition, analysis of the learning curve confirms that VCRL rapidly improves its performance from the initial stage and maintains stable results that outperform other methods even in the final stage.
Furthermore, ablation experiments revealed that both "Variance-based Dynamic Sampling" and "Replay Learning" contributed to the performance improvement, confirming the effectiveness and robustness of VCRL.

Categories related to this article

nakata

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

Overview

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects