Catch up on the latest AI articles

ChartCap: Suppressing Chart Captioning Hallucinations With Large Data Sets And New Evaluation Indexes

ChartCap: Suppressing Chart Captioning Hallucinations With Large Data Sets And New Evaluation Indexes

3 main points
✔️ ChartCap is a large dataset of over 560,000 real-world charts with high-quality captions
✔️ Introduces a mechanism to suppress illusions by eliminating extraneous information and covering structural elements and key insights
✔️ Model fidelity is assessed by the proposed metric VCS, and conventional methods and human Captioning outperforms conventional methods and human captioning

ChartCap: Mitigating Hallucination of Dense Chart Captioning
written by Junyoung LimJaewoo AhnGunhee Kim
(Submitted on 5 Aug 2025)
Comments: ICCV 2025 (Highlight)

Subjects:  Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This study aims to generate "accurate and information-rich descriptions (captions) for graphs and charts" by a model that integrates vision and language, according to the study.
Existing datasets for chart captions presented two major challenges.

First, captions extracted from papers and reports contain extraneous information that cannot be read from chart images.
Second, the captions did not adequately represent important insights such as axes, legends, and other structures, as well as maxima and trends.

These problems cause "halucinations" in the model, leading to misrepresentation.
To address this issue, the authors constructed a new dataset, "ChartCap," of more than 560,000 real-world charts, with high-quality captions that are free of extraneous information, while not over- or under-representing structural elements and key insights.

We also proposed a new metric, the Visual Consistency Score, which evaluates a chart by recreating it from the generated captions and comparing it to the original image.
This enables objective measurement of the model's ability to faithfully and accurately describe the actual chart.

Proposed Methodology

The authors designed a four-step automatic generation pipeline to build the ChartCap dataset.

First, only data-driven charts were extracted from millions of images, excluding figures other than charts (e.g., conceptual and schematic diagrams).
Next, we recognized chart types and titles using GPT-4o and other tools.
Then, structural elements and insights such as legends, axes, extremes, and trends are extracted according to the schema defined for each chart type.

In this process, the roles were divided between GPT-4o for coarse trend identification and Claude 3.5 Sonnet for processing that requires numerical accuracy.
Extraction results were compiled into a semi-structured format and finally converted into natural language captions.
For further quality assurance, instead of a human directly checking everything, we introduced a cycle-consistency-based verification process that "generates Python code from the captions and compares the reconstructed chart with the original image.

This streamlines human visual checking and allows for the low-cost construction of large data sets that are both accurate and comprehensive.

Experiments

In our experiments, we compared models trained on ChartCap with existing open source and commercial models.

In addition to the traditional BLEU and ROUGE, we used the proposed Visual Consistency Score (VCS) and OCRScore as evaluation metrics.
As a result, the model fine-tuned with ChartCap produced more accurate, more informative, and less illusory captions than the conventional model.

In particular, open source models such as Phi3.5-Vision-4B and InternVL2.5-8B outperformed even the commercial Claude 3.5 Sonnet when tuned with ChartCap.

They also achieved high accuracy with zero shots on other manually validated datasets such as VisText and Chart-to-Text, confirming its generalization capabilities.
In addition, in a comparison of human evaluations, the authors reported that in many cases the output of models trained with ChartCap was preferred over existing human-written captions.

This demonstrates that ChartCap is more effective than traditional data sets and can make a significant contribution to understanding and explaining real-world charts.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us