SKETCHPAD] Enhanced Inference Of Multimodal Language Models With Intermediate Sketches

Large Language Models 18/12/2024

3 main points
✔️Proposes anew framework, SKETCHPAD, that allows language models to generate intermediate sketches to improve inference performance
✔️Consistently improves base model performance in allmathematical tasks, including geometry, functions, graph algorithms, and game strategies
✔️ Vision-specific models (objectdetection, segmentation, depth estimation, etc.) andconsistently improves the performance of the base modelin computer vision tasks

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
written by Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna
(Submitted on 13 Jun 2024)
Comments: Project and codes url: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Sketching is an effective tool for a wide variety of purposes, including idea generation and problem solving. Because they directly convey visual and spatial information that cannot be expressed in words, they have been used for a variety of purposes around the world, from ancient rock drawings to modern architectural drawings. It is used by children to solve geometry problems, by engineers to explain prototypes, by architects to create blueprints, and even by scientists to communicate complex concepts and experimental results.

Recent advances in multimodal language models have focused attention on tasks that simplify reasoning by drawing what are called "intermediate sketches." In major benchmarks, such as those involving geometry or complex math problems, models are fed images of diagrams to answer questions that require symbolic or spatial understanding. At this point, for example, for geometry problems, inference performance can be improved by drawing intermediate sketches, such as auxiliary lines, just as one would when solving a human problem.

Computer vision benchmarks have similar characteristics, e.g., in object detection, the model's detection performance is improved by drawing bounding boxes around the object, and in depth estimation, by drawing a color map according to depth.Therecently proposed BLINK and VBenchalso focus on intermediate sketches.At the same time, however, the frameworks used for sketch-based inference in current language models have not been adequately studied.

This paper proposes SKETCHPAD, a tool that generates intermediate sketches for inference. It is inspired by chain-of-sort (CoT) reasoning in text andprompts theunderlying visual language modelto generate visual artifacts as part of amixedchain oftext, program, and visual reasoning.For example,when proving that the sum of the angles of a triangle is 180 degrees,as in Figure (a) below, SKETCHPADallows the agent to edit the diagram by introducing a new auxiliary line. It thenprovidessupplementaryinformation aboutthis new line and the angles that appear with it, and uses them to solve this geometry task.

SKETCHPADalso improves the model's spatial inference performance in computer vision. When determining whether a cookie is stacked on top of another cookie, as in Figure (d) above, the model first makes an intermediate depth estimation. Analyzing this estimation reveals that the cookies are stacked on top of each other, allowing the model to derive an accurate answer.

This paperdemonstrates the effectiveness of SKETCHPAD ona wide range of mathematical and computer visiontasks. In mathematics, the paper addresses questions in geometry, mathematical functions, graph algorithms, and strategy games. For geometry questions, SKETCHPAD prompts the model to generate Matplotlib code using auxiliary lines and variables based on the diagram input and questions.Even in the case ofpurelanguage input, SKETCHPAD allows models to plot functions and reason about their properties. These results demonstrate SKETCHPAD's ability to support inference even for language-based input.Inallcategories ofmath tasks, SKETCHPAD performs about 10% better than the baseline GPT-4 performance.

Computer vision tackles tasks as diverse as depth, spatial inference, jigsaw puzzles, visual correspondence, semantic correspondence, and questions from MMVP and VBench. In this area, SKETCHPAD allows models to generate segmentation masks, crop images, draw bounding boxes, zoom image regions, and overlay images. As with mathematics, SKETCHPAD consistently shows excellent performance on all seven types of computer vision tasks. For example, GPT-4 achieved a 14.3% improvement on VBench and 12.1% and 9.7% improvements on the BLINK depth and semantic correspondence tasks, respectively, with SKETCHPAD.

Furthermore, we analyzed the effects of SKETCHPAD and compared the model-generated and human-generated plans, and found that they were well aligned and showed similar inference patterns. SKETCHPAD is expected to serve as a catalyst for new research toward more advanced and interpretable multimodal AI.

New Framework "SKETCHPAD

This paper proposes SKETCHPAD, a general framework for multimodal language models to draw sketches as an intermediate step in reasoning, which can then be leveraged for further reasoning. The figure below provides an example of how SKETCHPAD might work.

Upon input of a multimodal query, the SKETCHPAD agent generates a sketch plan (Thought) to address the query and then generates a program to create the sketch (Action). The generated sketch (Observation) serves as a visual representation of the reasoning process, which the model then analyzes to produce a final output for the query.

In the first step, Thought, the model analyzes the context (including query, previous thoughts, actions, and observations) and generates a thought plan for the next action. For example, given the query"Find ∠EIC" in Figure (a) above , the model's thought plan would draw auxiliary line IX parallel to BD.

In the second step, Action, based on Thought, the model performs actions that manipulate both visual and textual content. In the geometry example, the model generates Python code that modifies the original geometric drawing to draw auxiliary lines. The generated code is compiled and executed.

In the third step, Observation, the SKETCHPAD environment returns a new observation based onAction.In the geometry example, a new diagram with new auxiliary lines is returned.

Multimodal language models can be quickly sketched using this framework without the need for fine tuning or learning.

This multi-turn interactive process continues until the model determines that it has collected enough information to answer the query. At this point, the model generates a special exit action and outputs the answer.

Unlike traditional research, where language models generate and manipulate primarily text-based observations and actions, SKETCHPAD allows models to manipulate both visual and textual content. This allows models to use the sketches they draw to plan and reason and improve their problem-solving abilities.

And at the heart of this SKETCHPAD is the sketching feature, which allows language models to generate programs to create sketches. These programs are executed by calling various specialized vision models and Python plotting packages.Similar to the recently reported ViperGPT and VPD, SKETCHPAD allows language models to sketch through code generation. Detailed descriptions of the tools that allow language models to generate multimodal content are provided through prompts (actual example prompts can be found in the supplemental material of the paper).

SKETCHPAD uses a variety of tools to achieve sketches, depending on the task. For math tasks, common Python packages such as matplotlib and networkx are used for plotting, and for image tasks, the language model leverages image models during sketching. These models include detection tools that draw bounding boxes on the image, segmentation and marking tools that draw colorful masks and label segments with numbers.

Sketching in Mathematical Tasks

Here we use SKETCHPADto tacklefour complex mathematical tasks (geometry, functions, graph algorithms, and game strategies). We show that incorporating sketching capabilities into a language model significantly improves performance on mathematical problems and achieves new state-of-the-art results.

First, we have a geometry problem. In this area,drawing auxiliary lines can be very helpful in solving problems. As we saw earlier, in Figure (a) below, the question is "find ∠EIC". In this case, the language model plans to draw an auxiliary line IX parallel to BD, thereby using the properties of parallel lines to find ∠EIC.

To evaluate the effectiveness of SKETCHPAD, we use a problem from the Geometry3K dataset: SKETCHPAD takes as input a geometry figure and the corresponding matplotlib code, proposes and modifies the code to generate auxiliary lines, runs it, and after updating it with the auxiliary lines added visualizes the diagram.

Next is the question of functions. Functions are important in a variety of applications in science, engineering, and economics. Herewe focus on thenext two tasks from the IsoBench dataset:even-odd classification andconvex-concave determination.Even-odd classification determines whether a function is an even function, an odd function, or neither. An even function satisfies f(-x) = f(x) for all x, and an odd function satisfies f(-x) = -f(x).The convex-concave decision alsodetermines whether a function is convex or concave.

While traditional language models analyze a function and attempt to prove its properties, SKETCHPAD can efficiently solve the problem by visually sketching the function. To determine the convexity of the function in Figure (b) below, SKETCHPAD uses matplotlib to plot the function and visually verify its overall shape.

The next step is the graph algorithm problem.Many real-world problems associated with computer networks and transportation systems can be formulated as graph algorithm problems, andwe evaluate SKETCHPAD on thefollowing three graph algorithm tasks from IsoBench: connectivity,maximum flow, andisomorphism.Graph connectivity determines whether a path exists between two vertices in a graph.Maximum Flowdetermines the maximum amount of flow that can be sent from a source vertex to a sink vertex in a network with edges subject to capacity constraints.Graphisomorphismdetermines whether two graphs are structurally equivalent.

Given a graph adjacency matrix like the one in Figure (b) below, SKETCHPAD uses Python's networkx library to draw the actual graph structure, allowing direct visual inference of graph properties and relationships.

Finally, there is game strategy.Chess games can be represented in a variety of formats, including visual board states and textual procedure notations. Even given only a textual procedure notation, SKETCHPAD draws the chess board, analyzes the positions, and formulates strategy; evaluates SKETCHPAD's performance on the winner identification task from the IsoBench dataset; and finds the outcome of the chess game based on the final board state Find the outcome of the game (white win, black win, draw). To create the graphical board, SKETCHPAD uses the Python chess library to draw the board using the Forsyth-Edwards Notation (FEN) of chess.

We evaluate SKETCHPAD performance using multimodal language models with API access (gpt-4-turbo-2024-04-29 and gpt-4o-2024-05-13). These results are compared to baselines without SKETCHPAD, major closed source models such as Claude 3 and GeminiPro, and open source models such as Mistral and LLaMA-2 70B.

As shown in the table below, SKETCHPAD consistently improves the performance of the base model across all tasks, averaging 18.8% for GPT-4o and 13.5% for GPT-4 Turbo.

In particular,significant improvements have been seen in graph algorithms such asgraph connectivity (Connectivity) andmaximum flow (Maxflow).For example, GPT-4o with SKETCHPADachieves 66.3% accuracy inMaxflow, a 41.3% improvement over the base model. Similarly, on the function task, GPT-4 Turbo achieves over 90% accuracy and GPT-4o over 88% accuracy, with significant improvements on the convexity and even-odd classification tasks. In addition, there is an improvement of about 20% in game strategy, indicating that the drawn game board improves reasoning about strategy. These results demonstrate that SKETCHPAD is an effective means of enhancing the reasoning capabilities of multimodal language models in a variety of areas.

Sketching in Computer Vision Tasks

Here, we use SKETCHPAD to tackle a complex visual reasoning task. Recent research (BLINK) has shown that many current multimodal language models still lack core visual recognition capabilities. Dedicated computer vision models, on the other hand, have such capabilities. Furthermore, SoM research has shown that drawing segmentation masks on images can draw on the strong visual foundational capabilities of GPT-4V. In this paper, we generalize these ideas in SKETCHPAD so that language models can be sketched using specialized vision models.

SKETCHPADexperiments withthree complex visual inference tasks (VBench,MMVP, andBLINK).BLINKis a benchmark that includes visual recognition tasks that areeasy for humans butchallenging formultimodallanguage models.Specifically, it includes relative depth, spatial inference, jigsaw puzzle, visual correspondence, and semantic correspondence tasks.

In SKETCHPAD, the language modelusesseveral modules (detection,segmentation,depth estimation,visual search with sliding windows, andother image manipulation modules) to sketch and manipulate images. These modules are implemented as Python functions that the language model can call

Thedetectionmodule takes an image and a text query (e.g., "cat") as input, runs the Grounding-DINO open vocabulary object detection model, and plots the bounding boxes (with number labels) detected in the image. It also returns the coordinates of the bounding box.

The segmentationmodule takes an image as input and returns an image with colorful segmentation masks drawn on it. Each mask is labeled with a number. The base segmentation models are SegmentAnything and Semantic-SAM.Thedepth estimationmodule takes an image as input and returns a depth map.Thebasemodel is DepthAnything.

The Sliding Window Visual Searchmodule mimics the way humans search for small items on an image. It takes a text query as input and executes a sliding window on the image. The window size is 1/3 of the image size and the step size is 2/9 of the image size. Returns a sequence of detected image patches.

Other image manipulation modulesinclude zoom in and crop (takes an image and bounding box as input and returns an image patch in the box) and overlay image (takes two images and alpha values as input and returns an overlaid image).

SKETCHPAD takes full advantage of these modules to dramatically improve the visual inference capability of multimodal language models. This provides a new approach to effectively solving complex visual tasks.

Here we experiment with multimodal language models in a complex visual reasoning task, comparing their performance with and without SKETCHPAD and with the leading multimodal language models (Gemini, Claude 3, LLaVA 1.5, LLaVA-NeXT The table below shows the performance of the SAMLA and SAMLA-NeXT models.Asshown in the table below, SKETCHPAD consistently improves the performance of the base model in all tasks, and in particular, GPT-4o with SKETCHPAD achieves new state-of-the-art results in all tasks.

VBench outperforms SEAL by improving accuracy by 18.5% for GPT-4 Turbo and 14.3% for GPT-4o.In BLINK, SKETCHPAD improves absolute accuracy by an average of 6.6% for GPT-4 Turbo and 9.0% for GPT-4o.

Even though SKETCHPAD's modules work on a single image, significant improvements can be seen in multi-image tasks (jigsaw puzzles, visual correspondence, semantic correspondence, etc.) GPT-4o, with its more powerful multimodal capabilities than GPT-4 Turbo, benefited more from SKETCHPAD SKETCHPAD. Overall, SKETCHPAD proved to be an effective way to improve the performance of multimodal language models in visual reasoning tasks.

Summary

In this paper, wepropose SKETCHPAD, anew framework for generating intermediate sketches for multimodal language models. This framework can significantly improve performance in complex mathematical reasoning tasks by visualizing auxiliary lines, mathematical functions, graphs, and games.

For the visual reasoning task, we added vision experts to SKETCHPAD; LM calls these experts during reasoning to visualize predictions, for example, bounding boxes from object detection models and masks from segmentation models, which are then observed for further planning and reasoning The LM then observes these predictions and performs further planning and reasoning.

Experimental results showed that SKETCHPAD achieved new state-of-the-art results, improving the performance of the language model in all tasks, and that SKETCHPAD is taking advantage of the complementary strengths of language and image to tackle increasingly complex inference challenges, making the language model a more human-like It is expected to be an important step toward having multimodal AI.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

SKETCHPAD] Enhanced Inference Of Multimodal Language Models With Intermediate Sketches

Summary

New Framework "SKETCHPAD

Sketching in Mathematical Tasks

Sketching in Computer Vision Tasks

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...