Catch up on the latest AI articles

CompassVerifier: A New Benchmark And Robust Model To Revolutionize LLM Solution Verification

CompassVerifier: A New Benchmark And Robust Model To Revolutionize LLM Solution Verification

3 main points
✔️ VerifierBench and CompassVerifier are proposed to overcome the limitations of conventional verification methods
✔️ CompassVerifier judges various answers including mathematical expressions, multi-stage reasoning, and invalid answers with high accuracy
✔️ Experiments show that it outperforms existing LLMs and verifiers, and its effectiveness as a reward CompassVerifier outperforms existing LLMs and verifiers, and is validated as a reward model.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
written by Shudong LiuHongwei LiuJunnan LiuLinchen XiaoSongyang GaoChengqi LyuYuzhe GuWenwei ZhangDerek F. WongSongyang ZhangKai Chen
(Submitted on 5 Aug 2025)
Comments: Technical Report; 31 Pages

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper focuses on "answer verification," which is essential for evaluating the performance of LLMs and designing rewards in reinforcement learning.

Conventional verification methods have been based on simple string matching using regular expressions or using generic LLMs as judges.
However, the former requires rule customization and is inflexible, while the latter requires task-specific prompt adjustment and carries a high risk of illusions and false positives.
Another major limitation was the lack of a comprehensive benchmark that could evaluate complex problems and diverse solution formats across the board.

To address these issues, the authors built a new evaluation platform called VerifierBench and developed a lightweight and highly accurate verification model called CompassVerifier.
This enables answer verification across multiple domains, including mathematics, knowledge, and reasoning, and presents a robust framework that can accurately identify not only incorrect answers but also invalid responses.

Proposed Methodology

The method proposed by the authors consists of two pillars.

The first is VerifierBench.
This is a benchmark of over 1.3 million responses collected from over 50 models and 15 datasets, maintained through multi-step automated validation and expert annotation. In addition to correct and incorrect responses, invalid responses (incomplete, repetitive, rejected responses, etc.) are clearly labeled, allowing for more precise performance evaluation than previously possible.

The second is CompassVerifier .
This model is based on VerifierBench and is enhanced by three extensions.
The extensions are (1) error-pattern-driven adversarial extensions to improve tolerance to misclassification, (2) complex formula extensions to improve equivalence checking for a variety of notations, and (3) generalization extensions to improve adaptability to different tasks and prompt formats.

These innovations make CompassVerifier more accurate and robust than conventional regular-expression-based and LLM-based verifiers.

Experiments

In our experiments, we trained CompassVerifier on a parameter scale from 3B to 32B and evaluated it using VerifierBench.
Comparisons were made with generic LLMs such as GPT-4o and DeepSeek-V3, as well as existing dedicated verifiers xVerify and Tencent-RLVR.
As a result, CompassVerifier achieved new SOTA in all areas. In particular, the 32B model achieved an accuracy of over 90% and an F1 score of over 87%, significantly outperforming similarly sized LLMs and existing verifiers.

In addition, in the evaluation by answer format, while high scores were obtained for multiple-choice questions, sequential answers and answers containing multiple subquestions were more difficult, and conventional models were limited to F1 scores of 40 or less, while CompassVerifier consistently maintained high accuracy.
Furthermore, the effectiveness of CompassVerifier as a reward model in reinforcement learning was also verified, and training with CompassVerifier showed higher convergence efficiency and performance improvement than with the rule-based verifier.

This confirms that the model is promising not only as an evaluation platform but also as a reward signal to guide learning.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us