
Semantics-Oriented Reward Design With "PrefBERT," A New Evaluation Method To Evolve Long Sentence Generation
3 main points
✔️ Traditional evaluation metrics do not properly measure the quality of long-form generation and diverge from human judgments
✔️ Authors developed a lightweight evaluation model PrefBERT to provide semantically consistent rewards
✔️ Using PrefBERT in experiments improved generation quality and obtained high ratings in human evaluations
Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
written by Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber
(Submitted on 18 Jun 2025)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
code:![]()
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This paper addresses the evaluation and reinforcement learning issues for the generation of free-form long sentences by large-scale language models.
Conventional evaluation metrics such as ROUGE and BERTScore only measure word overlap and embedding similarity, and do not adequately capture human-oriented aspects such as sentence coherence, information coverage, and appropriate sentence style. As a result, the desired reward signal could not be given in model training, leading to the problem of a plateau in the improvement of generation quality.
The authors therefore proposed a lightweight evaluation model called PrefBERT. It is trained using a variety of long-form responses and a 5-point rating by humans, and calculates more precise and semantically consistent scores. Experiments show that incorporating this model into the reinforcement learning method GRPO yields generation results that correlate better with human preferences than traditional rating scales. This result is positioned as an important step toward improving the quality of long sentence generation.
Proposed Methodology
The core of the proposed method is to utilize a small-scale BERT-based model called PrefBERT as a reward function.
First, teacher data is constructed by combining grammatically and semantically diverse response data and Likert scale scores based on human evaluation of their quality. As a clause structure, the reference and generated responses are combined as a single sentence, starting with a [CLS] token and separated by a [SEP], to create a unified input vector.
Linear regression and sigmoid functions are then applied to the whole-sentence embedding obtained from the pooling layer to output a normalized quality score ranging from 0 to 1. This score was used as the reward signal for GRPO to optimize the measures of the generative model.
Unlike conventional rule-based rewards, this method is able to capture the multi-layered characteristics of the language, allowing for proper assessment of the coherence and fluency inherent in long sentence generation. Furthermore, it is computationally efficient due to the small size of the model.
Experiments
Experiments were conducted on three datasets: ELI5, Alpaca, and LongForm. All of them contain long responses averaging about 185 words and include a variety of styles: expository, directive, and creative.
Models 1.5B and 3B of the underlying model Qwen2.5 were used for model training, and PrefBERT, GRM-LLaMA-3B, ROUGE-L, and BERTScore were compared as reward functions, respectively.
Evaluation was based on a combination of Likert score assignment by GPT-4 and relative ranking by human raters. The results showed that models using PrefBERT consistently scored higher than models using other reward functions of similar size. In particular, it excelled on measures of sentence structural clarity and information richness, confirming its ability to control overly redundant generation.
These results support the effectiveness of semantically-aware reward design.
Categories related to this article