
MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance
3 main points
✔️ MMR1 uses Variance-Aware Sampling with reward variance for stable reinforcement learning
✔️ Approximately 1.6 million CoT data and 15,000 RL data were released for reproducibility and development
✔️ Outperformed existing models on math and logic inference benchmarks, showing efficiency and generality Demonstrated efficiency and versatility
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
written by Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu
(Submitted on 25 Sep 2025)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This paper proposes a new learning strategy, Variance-Aware Sampling (VAS), to improve the performance of large-scale multimodal inference models.
In recent years, large-scale language models and multimodal models have made progress for complex tasks involving mathematics and logic.
However, Group Relative Policy Optimization (GRPO), a typical method for reinforcement learning, had a problem that the gradient disappears when the reward variance decreases, weakening the optimization signal and making the learning unstable.
In addition, the lack of high-quality, large-scale "chain-of-thought" data available to the public was another factor hindering reproducibility and research progress.
Therefore, in this study, we presented three contributions: (1) a data selection method VAS that stabilizes learning by increasing reward variance, (2) a large dataset containing about 1.6 million long CoT data and about 15,000 QA pairs for RL, and (3) the publication of multi-scale multimodal inference models.
Theoretical analysis revealed that reward variance guarantees a lower bound on gradient updates, proving that VAS can serve as a practical method for this purpose.
In addition, the published code and model suite serves as a resource that provides a standard baseline for the research community.
Proposed Methodology
The proposed method, VAS, was designed to overcome "gradient loss" in the learning process of GRPO.
The basic idea is that samples with higher reward variance are more beneficial for learning and produce stronger gradient signals.
To this end, VAS calculates a Variance Promotion Score (VPS) for each sample and selects training data based on this value.
The VPS consists of two components.
One is the Outcome Variance Score (OVS), which gives high scores to tasks with a well-balanced mix of correct and incorrect responses.
The other is the Trajectory Diversity Score (TDS), which prioritizes tasks that generate a variety of reasoning pathways.
This allows training to incorporate samples that are more informative for the model, rather than monotonous and predictable samples.
In addition, VAS is designed to combine with random sampling to promote reward variance while ensuring data coverage.
Theoretically, VAS is based on the Variance-Progress Theorem, which guarantees that the reward variance is a lower bound on the gradient, a mechanism that enhances learning stability and efficiency.
Experiments
Several benchmarks (MathVerse, MathVista, MathVision, LogicVista, and ChartQA) focused on mathematical and logical reasoning were used in the experiments.
The 3B and 7B models based on the Qwen2.5-VL family were employed as models, with generic models (e.g., InternVL, LLaVA-OV) and inference-specific models (e.g., VL-Cogito, R1-VL, MM-Eureka) selected for comparison.
As a result, MMR1-7B achieved an average score of 58.4, outperforming inference-oriented models of similar size.
In particular, significant improvements were seen on complex inference tasks such as MathVerse and LogicVista, indicating that VAS contributes to both learning stability and performance improvement.
In addition, the 3B model also produced comparable results to several 7B models, demonstrating high efficiency even under resource constraints.
Furthermore, ablation experiments revealed that initialization by Cold-start, reinforcement learning by GRPO, and stabilization by VAS complement each other and support the final performance.
This strongly supports the effectiveness and versatility of the proposed method.
Categories related to this article