ChartCap: Suppressing Chart Captioning Hallucinations With Large Data Sets And New Evaluation Indexes

29/08/2025

3 main points
✔️ ChartCap is a large dataset of over 560,000 real-world charts with high-quality captions
✔️ Introduces a mechanism to suppress illusions by eliminating extraneous information and covering structural elements and key insights
✔️ Model fidelity is assessed by the proposed metric VCS, and conventional methods and human Captioning outperforms conventional methods and human captioning

ChartCap: Mitigating Hallucination of Dense Chart Captioning
written by Junyoung Lim, Jaewoo Ahn, Gunhee Kim
(Submitted on 5 Aug 2025)
Comments: ICCV 2025 (Highlight)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This study aims to generate "accurate and information-rich descriptions (captions) for graphs and charts" by a model that integrates vision and language, according to the study.
Existing datasets for chart captions presented two major challenges.

First, captions extracted from papers and reports contain extraneous information that cannot be read from chart images.
Second, the captions did not adequately represent important insights such as axes, legends, and other structures, as well as maxima and trends.

These problems cause "halucinations" in the model, leading to misrepresentation.
To address this issue, the authors constructed a new dataset, "ChartCap," of more than 560,000 real-world charts, with high-quality captions that are free of extraneous information, while not over- or under-representing structural elements and key insights.

We also proposed a new metric, the Visual Consistency Score, which evaluates a chart by recreating it from the generated captions and comparing it to the original image.
This enables objective measurement of the model's ability to faithfully and accurately describe the actual chart.

Proposed Methodology

The authors designed a four-step automatic generation pipeline to build the ChartCap dataset.

First, only data-driven charts were extracted from millions of images, excluding figures other than charts (e.g., conceptual and schematic diagrams).
Next, we recognized chart types and titles using GPT-4o and other tools.
Then, structural elements and insights such as legends, axes, extremes, and trends are extracted according to the schema defined for each chart type.

In this process, the roles were divided between GPT-4o for coarse trend identification and Claude 3.5 Sonnet for processing that requires numerical accuracy.
Extraction results were compiled into a semi-structured format and finally converted into natural language captions.
For further quality assurance, instead of a human directly checking everything, we introduced a cycle-consistency-based verification process that "generates Python code from the captions and compares the reconstructed chart with the original image.

This streamlines human visual checking and allows for the low-cost construction of large data sets that are both accurate and comprehensive.

Experiments

In our experiments, we compared models trained on ChartCap with existing open source and commercial models.

In addition to the traditional BLEU and ROUGE, we used the proposed Visual Consistency Score (VCS) and OCRScore as evaluation metrics.
As a result, the model fine-tuned with ChartCap produced more accurate, more informative, and less illusory captions than the conventional model.
In particular, open source models such as Phi3.5-Vision-4B and InternVL2.5-8B outperformed even the commercial Claude 3.5 Sonnet when tuned with ChartCap.

They also achieved high accuracy with zero shots on other manually validated datasets such as VisText and Chart-to-Text, confirming its generalization capabilities.
In addition, in a comparison of human evaluations, the authors reported that in many cases the output of models trained with ChartCap was preferred over existing human-written captions.

This demonstrates that ChartCap is more effective than traditional data sets and can make a significant contribution to understanding and explaining real-world charts.

Categories related to this article

nakata

ChartCap: Suppressing Chart Captioning Hallucinations With Large Data Sets And New Evaluation Indexes

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation