Giving LLMs A Whiteboard To Write Down Their Reasoning Process Greatly Improves Their Visual Reasoning Ability!
3 main points
✔️ Proposed Whiteboard-of-Thought (WoT), a new prompting technique that exploits LLM's visual reasoning capabilities
✔️ Experiments using ASCII art to compare with existing methods such as CoT
✔️ Experimental results show significant performance improvements with the use of WoT confirmed.
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
written by Sachit Menon, Richard Zemel, Carl Vondrick
(Submitted on 20 Jun 2024)
Comments: Project website: this http URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence(cs.AI); Computer Vision and Pattern Recognition(cs.CV)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Recently, Large Language Models (LLMs ), such as ChatGPT, have been known to produce superior results in arithmetic and symbolic reasoning by representing intermediate reasoning parts in text through Chain-of-Thought (CoT).
On the other hand, even with extensive multimodal pre-training, the challenge of not being able to answer textual queries that can be easily resolved by humans through visual reasoning remains, and has been a headache for many researchers.
Against this background, this paper proposes Whiteboard-of-Thought as a simple prompting technique to bring out the visual reasoning ability of LLMs, which provides LLMs with a whiteboard to write down the reasoning steps as images, and demonstrates its effectiveness in a visual reasoning ability benchmark including ASCII art. We will describe a paper that demonstrates its effectiveness in benchmarking.
Introduction
"Which lowercase letter is a circle with a vertical line touching it to the right going down?"
(Which lowercase letter is tangent to the right side of the circle with a vertical line going down?)
When you read the sentence in question, you probably first draw a circle in your mind, then add a line, and finally imagine the letter "q".
Humans excel at this type of visual reasoning, easily interweaving verbal and image-based reasoning to solve problems and communicate ideas.
On the other hand, the author of this paper verified whether the latest LLM, GPT-4o (OpenAI et al., 2023), can solve this problem, and the results are shown on the left in the figure below.
GPT-4o incorrectly states that the answer is "b," and this result indicates that in tasks involving visual and spatial reasoning, even problems that are very easy for humans are difficult for LLMs.
To solve this problem, this paper utilizes Multimodal Large Language Models (MLLMs) to provide MLLMs with the ability to create and reason about explicit visuals, such as whiteboards that represent intermediate thoughts, and proposes a visual We propose a method that elicits abilities similar to visual reasoning by giving MLLMs the ability to create and reason about explicit visuals such as whiteboards that represent intermediate thinking.
Whiteboard-of-Thought
The goal of this paper is to give MLLM the ability to create images and process them visually to handle tasks that involve visual reasoning, such as those described above, and we propose a new prompting technique, Whiteboard-of-Thought (WoT), for this purpose.
The procedure for this method is shown on the right in the figure below.
This method uses common Python libraries such as Matplotlib and Turtleto create images (Visualization Image in the figure) for use in visual inference.
Against MLLM for that matter,
"You write code to create visualizations using the {Matplotlib/Turtle} library in Python, which the user will run and provide as images. Do NOT produce a final answer to the query until considering the visualization.
(Write code for the visualization using the Matplotlib/Turtle library in Python, which the user can then run and provide as an image. Do not create a final answer to the query until you have considered the visualization).
prompt.
The LLM then uses the visualization library and Python Interpreter to generate an image, which is then used for MLLM's inherent multimodal reasoning capabilities to output the final answer.
Experiments
To demonstrate the effectiveness of Whiteboard-of-Thought (WoT), this paper presentsan experiment measuring the recognition accuracy of information represented as text graphics in ASCII art from the large-scale benchmark BIG-Bench.
ASCII Understanding
ASCII art highlights the high level of visual reasoning that we humans unconsciously process in our brains, which requires us to interpret letters that have some natural linguistic interpretation (e.g., "=" as a symbol) in a visual context and focus on their placement and spatial relationships (e.g., "==" as a horizontal line) The following is a brief overview of the process.
For humans, these series of processes are handled unconsciously, but as mentioned above, they are very difficult tasks for existing MLLMs. By measuring recognition accuracy in this task, we demonstrated the effectiveness of the WoT.
At the beginning of this experiment, the following Python code was prepared to create ASCII art for use in the experiment.
The code is then executed to draw three types of ASCII art : MNIST, Words, and Kanji, as shown in the figure below.
In addition to the proposed WoT method, we also prepared Direct, which is inference by ordinary prompts, and CoT, which is inference by chain-of-thought, and compared their recognition accuracy. (For all methods, GPT-4o was used as the MLLM.)
The results of this experiment are shown in the figure below.
The results show that normal prompting and step-by-step reasoning have little effect on the ASCII art recognition task.
On the other hand, in the proposed method, WoT, we were able to confirm that significant performance improvements were achieved in all tasks.
This can be assumed to be because the WoT provided a pseudo-whiteboard to the MLLM, allowing the model itself to examine the visualized information and drawing out the MLLM's latent visual reasoning ability, a result that demonstrates the effectiveness of the WoT.
Summary
How was it?In this article, we proposed Whiteboard-of-Thought, a simple prompting technique for LLMs to bring out their visual reasoning ability, which provides LLMs with a whiteboard to write out their reasoning steps as images, and demonstrated its effectiveness on a benchmark that measures visual reasoning ability The paper demonstrated its effectiveness in a benchmark measuring visual reasoning ability, including ASCII art.
The experiments conducted in this paper demonstrate the effectiveness of Whiteboard-of-Thought in multiple tasks requiring visual and spatial reasoning, and as the performance of MLLM continues to improve, the performance of WoT is expected to improve as well.
The author has stated that "As computer vision advances, our method will only grow more useful. As computer vision advances, our method will only grow more useful.
The details of the Whiteboard-of-Thought presented here and the experiments conducted can be found in this paper for those who are interested.
Categories related to this article