Plot2Code] Benchmark For Testing Multimodal LLM Code Generation
3 main points
✔️ Proposes Plot2Code, a new benchmark for evaluating the code generation capabilities of multimodal language models
✔️ Introduces code pass rate, text agreement rate, and decision score using GPT-4V as evaluation metrics
✔️ Plot2Codereveals challenges and shows room for improvement in multimodal language models
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
written by Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo
(Submitted on 13 May 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Rapid advances in big data and computational power have led to the emergence of large-scale language models such as ChatGPT and GPT-4, both in commerce and academia. Along with this, multimodal large-scale language models are rapidly evolving, including GPT-4V, Gemini, Claude-3, and the open source models LLaVA and Mini-GPT. Various evaluation benchmarks have been created to assess the ability of these models to interpret visual information, but there remains a lack of research on "charts in text-dense images".
This paperevaluates the ability ofmultimodal large-scale languagemodels to generate code that effectively renders images and demonstrates their capabilities with respect to multimodal understanding and reasoning.The issue ofmultimodal large-scale languagemodels accurately interpreting visual information and associating it with text to generate executable code is considered an important topic for future research.
Therefore, this paper proposes a new evaluation benchmark, Plot2Code. This benchmarkis designed to evaluate the multimodal comprehension, inference, and coding capabilities of multimodallarge-scale languagemodels, using a dataset containing 132 matplotlib plots. Each plot is accompanied by the corresponding code and a detailed description, and evaluation settings are provided for a variety of input and output formats.In addition, we use this benchmark toevaluate14publicly availablemultimodal large-scale languagemodels . And the results show that there is room for improvement in the visual information coding task.
The figures below provide an overview of Plot2Code.(a) a representative sample of reference plots in the Plot2Code dataset, (b) asample plot generated by amultimodallarge-scalelanguagemodelusing a reference image, and (c) acomprehensive pipeline used to evaluate the code generation capability of the multimodal large-scale language model.
It is expected thatPlot2Code will be effectively used by the research community topromote further research and development ofmultimodal large-scale languagemodels.
Building Plot2Code Benchmark
Here is how we collect and process the benchmark data.First, we crawl all website links listed in the matplotlib gallery and extract code blocks from each HTML file. This yielded 841 code blocks, which were then subjected to several processes, including filtering.The figure below shows a sample Plot2Code benchmark.
First,we obtain a structured plot/code pair that allows us to effectively evaluate the code generation capabilities of themultimodal large-scale languagemodel. The Python code initially crawled is not necessarily suitable for generating high-quality plots. Therefore, we combine automatic processing with manual filtering.
The collected data may contain multiple code segments in a single HTML file, and some segments may contain import lines or initialization functions that prevent the generation of plots. To address this, we extract code only from HTML files that contain a single block of code. This ensures that all important components are included and that plots can be generated without additional dependencies. We then filter out all codes that cannot generate a plot and obtain 529 plot and code pairs.
For this benchmark, it is assumed that the plots are simple static diagrams with no animation or interaction. Therefore, each plot is considered an image file rendered by the matplotlib engine. For this reason, we filter out plots that contain certain tags (e.g. animation, widgets, event handling, etc.) in the corresponding URL. A detailed breakdown of the plot and code pairs in the dataset is shown in the figure below.
In addition, after the above process, manualfilteringwas performed toobtain the final resultsbased on the following criteria
- Plots do not depend on external files and can be rendered directly using the corresponding code
- Plots are diverse in terms of size, text, color, and type and serve as a broad benchmark for commonly used charts and plots
- Plots are evenly distributed across a range of difficulty levels, from beginner to professional
Manual filtering employs a more rigorous approach to retain only high quality plots, ultimately yielding 132 test samples to be used as benchmarks.
The test set createdis evaluated in two ways: Direct Asking andConditional Asking.Direct Asking requests that the code be surrounded by specific markers so that the code can be easily extracted from the responses generated bythe multimodal large-scale language model. This allows for extraction using regular expressions.
In "Direct Asking," an image is given as input to a large multimodal language model and asked to generate executable code that produces a graph similar to it.
In Conditional Asking, a multimodal large-scale language model is given an image and a condition (a textual instruction) as input and asked to generate executable code that produces a result according to the specified condition. In the case of the large-scale language model, the only input is the condition, and the other requirements match those of the multimodal large-scale language model. These instructions instruct the user to use GPT-4 to extract from the reference code and avoid implementation details of the code while retaining all information necessary for reproduction.
Statistics for the test samplesare shown in the table below. The type of each sample is also determined based on the tags present in the URLs retrieved from the matplotlib gallery. These categories are defined by the matplotlib gallery. The most common types include lines, bars, and markers; other categories include contour lines, fields, pie charts, polar plots, subplot axes, statistical expressions, and text labels and annotations.
For this benchmark,code pass rate, text match rate, and GPT-4v judgment score are proposed to measure various aspects of the evaluation.
The multimodal large-scale language model is expected to generate code that can be rendered into images using matplotlib. Therefore, a "code pass rate" is calculated to determine if the multimodal large-scale language model can generate executable code based on the input reference image and instructions.
We have also designed an evaluation pipeline that utilizes the GPT-4v model to assess the similarity between the generated plots and the reference plots. This pipeline assigns a rating on a scale of 1 to 10 to the test sample, taking into account overall appearance, color, shape, location, and other visual factors. The prompts used for the evaluation are shown below.
Also, while the similarity assessment as determined by GPT-4v is important, it does not take into account detailed plot components such as text, which is important for plot interpretation. Therefore, in order to evaluate the similarity between the generated plots and the reference plots in terms of detail, we have also introduced the "Text Agreement Ratio". This metric evaluates the accuracy of the text present in the reference sample, ensuring that text elements are accurately reproduced in the plots being evaluated and that there is no extra text in the generated images.
Furthermore, asshown inthe table below,the dataset constructed in this paper covers the widest range of evaluation settings and metrics compared to all other unimodal and multimodal code benchmarks.
Experimental Results
Here we evaluate and compare the performance of various multimodal large-scale language models using the Plot2Code benchmark. Both closed-source commercial API models and open-source models are used.
Fourteen leading closed-source and open-sourcemultimodal/large-scale languagemodels are evaluated. These include GPT, DeepSeek, Mistral, Mixtral, and Yi. Various prompting strategies such as Chain-of-Thought and Plan-and-Solve are also being considered.
Evaluation is performed in two settings: "direct requests" and "conditional requests". Large-scale language models that do not have the ability to interpret visual information are evaluated using only "conditional requests".Pairwise evaluation oftwomultimodal/large-scalelanguagemodels was conductedusing GPT-4Vjudgments, and the correlation between GPT-4V judgments and human evaluations was analyzed Thequantitative results in the Plot2Code benchmark are shown in the table below.
ThePlot2Codebenchmarks prove to be difficult benchmarks to clear even for advanced models. For example, Claude-3-Opus, Gemini-Pro, and GPT-4V only received ratings of 7.68, 7.10, and 7.68, respectively, in the conditional requirement setting, indicating significant room for improvement. Adding instructionsalso lowered the pass rates formultimodal large-scale languagemodels, for example, Gemini-Pro's pass rate dropped from 68.2% to 55.3%. models that rated well in other benchmarks, such as MT-bench and HumanEval, still failed in Plot2Code is a more demanding evaluation, as it tests the ability to understand and reason about visual information.
The differences between closed source and open source models are also examined.The performance of open source models is found to be inferior to that of closed source models. For example,when evaluating thelatest open sourcemultimodal large-scale language modelsDeepSeek-VL, Mini-Gemini- and LLaVA-Next, the best performing model was Mini-Gemini-8x7B-HD, which scored 6.08 in the GPT-4V decision and a code pass rate of 58.4%. However, this performanceis not comparable tocommercial closed-source multimodallarge-scale language models. It is clear that the open source community needs to develop models with the ability to compete with and even exceed commercial models.
Thus, the Plot2Code benchmarkprovides a detailed evaluation of the performance ofvariousmultimodal large-scale languagemodels and identifies areas for improvement and challenges.
We also analyzed various other aspects of the model, including prompting strategy, backbone LLMs, and resolution settings.As shown in the table below (reproduced below), we found a strong correlation between the performance of the models and the large language models on which they are based used. This is confirmed by both Mini-Gemini and LLava, suggesting that a strong backbone is required for the Plot2Code task. A strong backbone will aid the inference process and improve the ability to generate executable code.
We also compare the performance of the two different evaluation settings. The table above (reproduced above) shows thatfor the multimodal large-scale language model,the overall pass rate is higher for "direct requests"than for "conditional requests".This may be due to the fact thatin "conditional requests,"additional instructions impose severe constraints on the generated code, making it more difficult to generate executable code. However, the additional instructions do improve the similarity of the generated images; we also compare the impact of prompting strategies such as Chain-of-Thought and Plan-and-Solve, but these prompts do not show a clear advantage over the default prompts.
In addition, we extend the GPT-4V decision setting to pairwise evaluation of two multimodal large-scale language models, following the traditional pairwise model evaluation. For each reference sample, we let GPT-4V decide which generated image is more similar. To reduce the influence of different positions, the two generated images are swapped for additional evaluation, and the model is considered a winner only if it wins in both rounds. The results are shown in the figure below.
From these results, we infer that the addition of image input to GPT-4V is beneficial for generating high-quality plots compared to GPT-4, and that the commonly used prompting strategy, Chain-of-Thought, does not provide any additional advantage in this benchmark.By analyzing the impact of different settings, we identify specific improvements that can be made to improve the performance of multimodal large-scale language models.
Summary
This paper proposes a comprehensive benchmark, Plot2Code, for evaluating the code generation capabilities of multimodal language models. The benchmarkcovers avariety of complexity and types ofplotsand has been shown to be a powerful tool for evaluating the performance of different models. We also propose metrics such as code pass rate, text agreement rate, and GPT-4V decision score, and show that these are useful for evaluating the overall performance of a model.
Evaluation using the Plot2Code benchmark reveals significant performance differences among the models, suggesting the challenges posed by this task and the room for improvement in current models. While some models are able to generate executable code and produce plots that resemble reference images, they still find it difficult to accurately reproduce all textual elements and fine details.
In the future, it is hoped that the Plot2Code benchmark willadvance further research and development ofmultimodal reasoning,text-dense image interpretation, and complex code generation capabilities. It is hoped that further research in this area will further close the gap between multimodal language models in the open source community and closed source commercial APIs.
Categories related to this article