[Set-of-Mark Visual Prompting] Prompting Technology To Enhance GPT-4V's Image Recognition Capability

Prompting Method 18/01/2024

3 main points
✔️ Prompting technology to enhance GPT-4V's image recognition capabilities
✔️ Simply segment & mark input images in advance
✔️ Capture relationships between objects in an image

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
written by Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
(Submitted on 17 Oct 2023 (v1), last revised 6 Nov 2023 (this version, v2)])
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Research Outline

Recently, the paid version of ChatGPT has GPT-4V, which can also handle images. Thus, ChatGPT users can now enter an image as a prompt and ask ChatGPT a question about that image.

However, with the conventional GPT-4V, the relationship between objects in the image could not be captured well.

Therefore, Set-of-Mark Visual Prompting (SoM) was developed by a research team at Microsoft and others to improve the ability to capture such relationships among objects. For example, the right figure below shows the output results of GPT-4V with our SoM, and the left figure shows the output results without any innovations.

In the figure on the right, you can see that the objects are separated by segmentation and each is marked. Furthermore, the GPT-4V Conversation results show that those with SoM are correct, while those without SoM are incorrect.

Let's look at how SoM is achieved in the next section.

Set-of-Mark Visual Prompting Overview

The SoM mechanism is quite simple, and the process is as follows.

Segmentation of objects in an image with an object detection model called SoM
Mark each segment by SoM
Input images processed by SoM to GPT-4V
Enter prompt text as usual on the GPT-4V side

In short, by dividing the image into regions by semantic segmentation and marking each region, "we are making it easier for GPT-4V to recognize the positional relationship of each object," and that's all.

The key point of this research is that the GPT-4V's image recognition capability can be improved by simply modifying the image input to the GPT-4V.

The SoM is outlined in the figure below.

This diagram is outlined in the following figure.

Prompt text: bottom center Question "Can you count hou many fruits and what are the categories in the basket?"
Input image: Image of an apple around the center
Output of ordinary GPT-4V: rightmost output diagram
Output of GPT-4V by SoM: Leftmost output diagram

Here, the same prompt statement is used whether SoM is applied or not.

If the "image showing multiple apples" is input directly to GPT-4V, GPT-4V will output the wrong answer, as shown on the right side of the figure.

On the other hand, if you run this image through SoM for segmentation & marking, and then input the image with the objects so marked to GPT-4V, GPT-4V will output the correct answer, as shown on the left side of the figure.

Now that we have a general idea of SoM, let's look at image segmentation methods using SoM.

Image Segmentation Methods

In order to properly use the SoM prompt, the input image must be segmented into meaningful regions. To this end, the following model is used in this study

MaskDINO
SEEM
SAM
Semantic-SAM

As shown in the figure above, the area to be segmented changes depending on the segmentation model, so a comparative study is necessary.

Marking Method

After the image has been segmented and divided into meaningful regions, a mark is generated over each region. Here, the following two points should be considered

Mark Type
Mark Location

Various forms of marks will be considered, including alphabets, numbers, boxes, and mask borders. These marks must of course be easily recognized by GPT-4V.

Also important is how the marks are placed in each area. The basic approach is to place the mark in the center of each area, but this may cause areas to overlap.

To avoid such overlaps, the following mark placement algorithm has been proposed.

The algorithm begins by calculating the areas of the mask and sorting them in ascending order. It then determines the optimal mark position for each area.

Here, the Find_Center function is used to find the center of a given region r. DT(r) performs a distance transform on the region r and calculates how far each pixel is from the region boundary. Then arg max(D) finds the position of the maximum value from the distance map D obtained by the distance transform. This is the center c of the region.

This operation is performed for each region to ensure that there is no overlap, the center of each region is determined, and a mark is placed at the center of the region.

By pre-processing the image in this way, you can interact with the GPT-4V as follows

Experiment

Experimental Details

The following benchmarks were used in the comparative experiments of this study

We also validate the benefits of the proposed Set-of-Mark (SoM) prompt by comparing it to the default GPT-4V baseline.

In addition, we quantitatively evaluate LLaVa-1.5, a state-of-the-art open-source LMM, and qualitatively compare it to MiniGPT-v2. These models are trained using large amounts of data from targeted visual tasks. This study is the first to compare closed-source and open-source LMMs in a visual benchmark.

In addition, each segmentation task compares GPT-4V with various models, including MaskDINO and OpenSeeD.

Result

The results of the quantitative evaluation are as follows

The results show that our strategy of applying SoM to GPT-4V has the best performance.

Summary

The application of Set-of-Mark (SoM) to GPT-4V, in which symbolic marks are superimposed on specific regions of an image, can bring out the image recognition capabilities of GPT-4V, this study suggests.

SoM is expected to facilitate research on multimodal prompts in future LMMs and pave the way for multimodal general-purpose artificial intelligence (AGI).