CompassVerifier: A New Benchmark And Robust Model To Revolutionize LLM Solution Verification

24/08/2025

3 main points
✔️ VerifierBench and CompassVerifier are proposed to overcome the limitations of conventional verification methods
✔️ CompassVerifier judges various answers including mathematical expressions, multi-stage reasoning, and invalid answers with high accuracy
✔️ Experiments show that it outperforms existing LLMs and verifiers, and its effectiveness as a reward CompassVerifier outperforms existing LLMs and verifiers, and is validated as a reward model.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
written by Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen
(Submitted on 5 Aug 2025)
Comments: Technical Report; 31 Pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper focuses on "answer verification," which is essential for evaluating the performance of LLMs and designing rewards in reinforcement learning.

Conventional verification methods have been based on simple string matching using regular expressions or using generic LLMs as judges.
However, the former requires rule customization and is inflexible, while the latter requires task-specific prompt adjustment and carries a high risk of illusions and false positives.
Another major limitation was the lack of a comprehensive benchmark that could evaluate complex problems and diverse solution formats across the board.

To address these issues, the authors built a new evaluation platform called VerifierBench and developed a lightweight and highly accurate verification model called CompassVerifier.
This enables answer verification across multiple domains, including mathematics, knowledge, and reasoning, and presents a robust framework that can accurately identify not only incorrect answers but also invalid responses.

Proposed Methodology

The method proposed by the authors consists of two pillars.

The first is VerifierBench.
This is a benchmark of over 1.3 million responses collected from over 50 models and 15 datasets, maintained through multi-step automated validation and expert annotation. In addition to correct and incorrect responses, invalid responses (incomplete, repetitive, rejected responses, etc.) are clearly labeled, allowing for more precise performance evaluation than previously possible.

The second is CompassVerifier .
This model is based on VerifierBench and is enhanced by three extensions.
The extensions are (1) error-pattern-driven adversarial extensions to improve tolerance to misclassification, (2) complex formula extensions to improve equivalence checking for a variety of notations, and (3) generalization extensions to improve adaptability to different tasks and prompt formats.

These innovations make CompassVerifier more accurate and robust than conventional regular-expression-based and LLM-based verifiers.

Experiments

In our experiments, we trained CompassVerifier on a parameter scale from 3B to 32B and evaluated it using VerifierBench.
Comparisons were made with generic LLMs such as GPT-4o and DeepSeek-V3, as well as existing dedicated verifiers xVerify and Tencent-RLVR.
As a result, CompassVerifier achieved new SOTA in all areas. In particular, the 32B model achieved an accuracy of over 90% and an F1 score of over 87%, significantly outperforming similarly sized LLMs and existing verifiers.

In addition, in the evaluation by answer format, while high scores were obtained for multiple-choice questions, sequential answers and answers containing multiple subquestions were more difficult, and conventional models were limited to F1 scores of 40 or less, while CompassVerifier consistently maintained high accuracy.
Furthermore, the effectiveness of CompassVerifier as a reward model in reinforcement learning was also verified, and training with CompassVerifier showed higher convergence efficiency and performance improvement than with the rule-based verifier.

This confirms that the model is promising not only as an evaluation platform but also as a reward signal to guide learning.

Categories related to this article

nakata

CompassVerifier: A New Benchmark And Robust Model To Revolutionize LLM Solution Verification

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation