
SCIVER's Future: The Frontiers Of Multimodal Scientific Claim Verification
3 main points
✔️ SCIVER proposes a new benchmark for scientific claim verification that integrates text, tables, and figures
✔️ Comparison of human experts and advanced models shows significant inference inaccuracy of models
✔️ Multi-step inference and misinterpretation of visual information revealed as major challenges for models
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
written by Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao
(Submitted on 18 Jun 2025)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)
code:![]()
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
In this paper, a new benchmark, SCIVER, is proposed to verify the correctness of claims contained in scientific papers from a wide variety of information.
The benchmark utilizes a context that combines multiple modalities, such as text, tables, and figures, to evaluate how accurately a model can verify claims.
SCIVER includes 3,000 examples drawn from a total of 1,113 computer science articles, with each example accompanied by expert-annotated rationale information. The models tested include 21 advanced foundational models such as GPT-4.1 and Gemini.
As a result, the human experts achieved an average accuracy of 93.8%, while the state-of-the-art models were only about 70% accurate. This difference demonstrates how difficult it is for current models to perform advanced inference in a multimodal context.
Proposed Methodology
SCIVER's design is based on a task structure that includes four different types of reasoning to assess the multimodal reasoning capabilities of the model.
First, "direct inference" measures the ability to extract a single piece of information to verify a claim. Parallel Reasoning" questions the ability to integrate multiple sources of information, while"Sequential Reasoning" requires reasoning by connecting evidence step by step. In addition, "Analytical Reasoning" tests the ability to combine specialized knowledge and complex logic to make decisions.
Annotation work is performed by 18 field experts, who rigorously verify the consistency of claims and evidence. In particular, the design of the system is characterized by its ability to recognize the content of images and figures in tables, rather than mere text processing. Furthermore, through error analysis, it became clear that the model stumbled mainly in the search for evidence and in multi-step reasoning.
Experiments
In the evaluation experiments, SCIVER verification tasks were performed on advanced models such as GPT-4.1, Gemini-2.5-Flash, and o4-mini, as well as open source models such as Qwen2.5-VL and Mistral.
In the experimental setting, each model was given a multimodal context containing text, tables, and figures, as well as claims, and asked to infer the correctness or incorrectness of the claims. The models output the inference process sequentially based on the Chain-of-Thought prompt, followed by an automatic final correctness or incorrectness decision.
As a result, the best model was only about 77% accurate while the human expert was 93.8% accurate. In addition, the accuracy of the models tended to decrease as the evidence increased. Additional Retrieval-Augmented Generation settings showed some performance improvement, but still multi-step inference and misinterpretation of visual elements remained major challenges.
Categories related to this article