Pref-GRPO: A New Method For Stable Reinforcement Learning Of Text Image Generation Using Pairwise Comparison

31/08/2025

3 main points
✔️ Conventional GRPO with score-based rewards is prone to "reward hacking" and compromises the quality of the generated images
✔️ Proposed Pref-GRPO utilizes relative preferences based on pairwise comparisons to achieve stable optimization
✔️ New benchmark, UniGenBench, enables fine-grained evaluation of logical reasoning and grammars UniGenBench, a new benchmark, enables fine-grained evaluation of logical reasoning, grammar, understanding, etc.

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
written by Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang
(Submitted on 28 Aug 2025)
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper proposes a new approach to reinforcement learning methods for text-to-image (T2I) models.

Conventional GRPO (Group Relative Policy Optimization) methods have used a score-based reward model to evaluate the quality of generated images, and have updated measures by normalizing scores within groups.
However, this approach is prone to a problem known as "reward hacking," in which scores increase while image quality decreases.

The authors point out that this is caused by "illusory advantage.
This occurs when the difference in scores between generated images is very small and normalization overemphasizes the difference.

To solve this problem, the study proposed a new method called Pref-GRPO.
This is a mechanism that updates measures based on relative preferences between image pairs (pairwise preference) rather than absolute scores.

In addition, the authors designed a new benchmark called "UniGenBench" for model evaluation, which enables evaluation of T2I model performance in fine-grained dimensions.
The significance of this work lies in the fact that it overcomes the limitations of conventional methods and enables learning of image generation that is more stable and more in line with human preferences.

Proposed Method

The central idea of Pref-GRPO is to shift the learning goal from conventional reward score maximization to "relative preference matching.

Specifically, multiple images are generated for a given prompt, and they are compared pairwise.
A pairwise reward model (PPRM) is used to determine which image is preferred, and the win rate is used as the reward signal.
The win rate for each image is normalized within the group and used to update the measures.

This design has three advantages.
First, the use of the win rate expands the variance of the reward and allows for a clearer distinction between good and low quality images.
Second, because it is based on relative rankings rather than absolute score differences, it is robust against reward noise and reduces the incidence of reward hacking.
Third, because it reflects the fact that human judgments are inherently based on relative comparisons, it can provide a more natural and faithful reward signal.

Furthermore, in terms of evaluation, the "UniGenBench" proposed by the authors enables evaluation of even detailed dimensions such as text comprehension and logical inference, allowing a precise analysis of the strengths and weaknesses of the model.

Experiments

In the experiments, we first compared Pref-GRPO with existing reward maximization methods (HPS, CLIP, UnifiedReward, etc.).
Flux.1-dev was used as the base model, and UniGenBench was employed for evaluation.
The results showed that Pref-GRPO improved the overall score by about 6 points, especially in the logical reasoning and text drawing dimensions.

In addition, while "reward hacking," in which the quality of images deteriorates while reward scores increase during training, was observed with the conventional method, this phenomenon was effectively suppressed with Pref-GRPO.
Furthermore, a qualitative comparison of the generated images showed that while existing methods showed unnatural tendencies such as excessive saturation, Pref-GRPO produced a more natural and stable representation.
In addition, stable performance improvements were observed in external benchmarks (GenEval and T2I-CompBench).

Extensive model comparisons using UniGenBench showed that closed source models such as GPT-4o and Imagen-4.0-Ultra performed well, while open source models such as Qwen-Image and HiDream also showed rapid progress.
In general, it can be concluded that this method is an effective approach to significantly improve the stability and utility of T2I reinforcement learning.

Categories related to this article

nakata

Pref-GRPO: A New Method For Stable Reinforcement Learning Of Text Image Generation Using Pairwise Comparison

Summary

Proposed Method

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation