[Segment Anything] Zero-shot Segmentation Model

Segmentation 18/06/2024

3 main points
✔️ Build a model that can perform segmentation in zero-shot
✔️ Provides a large segmentation data set built with 11 million images and 1.1 billion masks
✔️ Experiments show zero-shot performance for new image distributions and a variety of tasks

Segment Anything
written by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick
(Submitted on 5 Apr 2023)
Comments: Project web-page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The Segment Anything project addresses the three components of task, model, and data with the goal of building a foundational model for image segmentation with Meta.

The project provides a large segmentation dataset, SA-1B, constructed from 11 million images and 1.1 billion masks. The Segment Anything Model (SAM) is also designed and trained to be promptable, allowing for zero-shot transitions to new image distributions and tasks.

Through experimentation on many of the tasks described below, we have evaluated its potential and found its zero-shot performance to be excellent.

task

Referring to the fact that NLP pre-learns the basic model by solving a token prediction task and solves various downstream tasks through prompt engineering, a task with similar functionality to build a basic model for image segmentation is the promptable segmentation We propose this as a task.

The goal of the promptable segmentation task is to return a valid segmentation mask when given a segmentation prompt. Prompts here refer to information about what to segment in the image, such as foreground and background point combinations, bounding boxes, segmentation masks, and free-form text.

The requirement for a segmentation mask to be valid here is that even if the given prompt is ambiguous, as in Figure 1, and can indicate multiple objects, for example, full body, upper body, and head, at least one of them can output a valid mask.

Figure 1 Mask output by SAM from a single point

Creating a model that can accommodate a wide range of existing or new segmentation tasks through prompts is a major difference from previous multitasking systems. Multitasking systems perform a fixed set of tasks such as joint segmentation, instance segmentation, and panoptic segmentation, but the training and testing tasks are the same. A model trained for a promptable segmentation task can act as an element of a larger system to perform new and different tasks during inference.

Model

As a model for promptable segmentation, we propose the Segment Anything Model (SAM), which has three components: an image encoder, a prompt encoder, and a mask decoder, as shown in Figure 2.

Figure 2 Overview of Segment Anything Model

The first image encoder uses ViT pre-trained with MAE (Masked autoencoders). This image encoder is run once per image and is applied before prompting the model.

The prompt encoder separates sparse prompts, which are points, bounding boxes, and text, from dense prompts, which are segmentation masks. Points and boxes are position-encoded and text is encoded using CLIP's text encoder. The mask is embedded using convolution and summed per element of the image embedding.

The mask decoder generates masks from image embeddings, text embeddings, and output tokens. The transformer decoder proc is modified to update all embeddings using prompt self-attention and cross-attention in two directions: prompt to image embedding and vice versa. sampling, MLP (Multilayer perceptron) computes the foreground probability of the mask at each image location.

In the case of a single output, given an ambiguous prompt, the model makes multiple mask predictions for a single prompt because it averages out multiple valid masks. In the general case, it is clear that three mask outputs are sufficient: whole, partial, and finer-grained masks. (See Figure 1)

Data-set

Since segmentation masks are not abundant on the Internet, a data engine is built for data collection. The data engine goes through three stages for annotation of segmentation masks on images: a manual stage, a semi-automatic stage, and a fully automatic stage, respectively. Dataset SA-1B consists of 11 million images and 1.1 billion segmentation masks from this fully automatic stage.

In the first manual stage, a professional annotator uses a browser-based interactive segmentation tool with SAM to label the mask by clicking on foreground and background object points. They also use pixel-accurate brushes and erasers to enhance the mask.

At the beginning of this stage, SAM trained on a general segmentation dataset, and after sufficient annotation, the model was retrained using newly annotated data to improve the model.

The next semi-automatic stage aims to increase the diversity of masks in order to improve the ability to perform segmentation. First, masks are automatically detected, and then the annotator performs additional annotation of undetected objects. As in the first stage, periodic relearning is performed in this stage as well.

In the last stage, the annotation was performed automatically, based on the fact that enough masks had been collected in the previous stages and that a model had been developed to predict valid masks even in ambiguous cases. The model predicts the masks that may correspond to valid objects for 1024 points in a 32x32 point grid. If this point is a part or a finer part, the model will output three masks: whole, part, and finer part. The IoU prediction module of the model is used to select a mask with a higher degree of confidence and more stability. Finally, NMS (non-maximal suppression) is used to eliminate duplicates.

The images and masks used for each stage are shown in the table below.

Table 1 Number of images and number of annotated masks for each stage

	Number of images	Number of masks
Manual Stage	120,000 sheets	4.3 million pieces
Semi-automatic stage	180,000	5.9 million pieces
Fully automatic stage	11 Million	1.1 billion pieces

Experiment

We are experimenting with five different tasks.

Evaluation of masks inferred from a single point

Evaluates object segmentation from a single foreground point using 23 different datasets with diverse images. The evaluation compares the results with RITM. The ground truth of most datasets does not enumerate all detectable masks, so the annotator compensates by evaluating mask quality from mIoU, the average of all IoUs with predicted masks and ground truth SAM outperforms RITM in 16 of the 23 datasets outperforming RITM.

Figure 3 Comparison of SAM and RITM on 23 different data sets

Edge detection

Using the BSDS500 dataset, we evaluate the SAM in the edge detection task: from a 16x16 grid, the SAM predicts 768 masks and removes redundant masks by NMS. The edge map is then computed from the Sobel filter through the edge NMS. Figure 4 shows that even though the SAM is not trained for edge detection, it is able to detect edges, including those that were not even in the ground truth.

Figure 4 Results of Zero-Shot Edge Detection (Input Image, Ground Truth, SAM)

Object proposal

Evaluate SAM in the task of object proposals, which play an important role in object detection. We run a slightly modified version of the automatic mask generation pipeline and output object proposals with masks as suggestions. using the LVIS dataset v1, we calculate the average recall and compare the results with ViTDet-H in Table 2. unlike SAM, ViTDet-H is trained with LVIS and therefore has a higher overall performance. However, SAM also outperforms ViTDet-H for medium and large objects, as well as for unusual objects.

Table 2 Results of object proposals for LVIS dataset v1

Instance Segmentation

Instance segmentation in SAM by using the bounding box of object detection by ViTDet as a prompt We use the COCO and LVIS datasets and compare them to the mask predicted by ViTDet. Table 3 shows that SAM's mask AP is below that of ViTDet. However, as shown in Figure 5, compared to ViTDet, SAM tends to produce higher quality masks with cleaner boundaries. Figure 6 also shows that SAM consistently outperforms ViTDet in human evaluation.

Table 3 Instance Segmentation Results

Figure 5 Semantic segmentation detection results (Grand Truth, ViTDet, SAM)

Figure 6 Evaluation of masks by humans

Mask generation from text

Consider the task of segmenting from free-form text. The same SAM was used prior to this experiment, but this task is trained to recognize text in a way that does not require new text annotations. extract CLIP's image embeddings for each mask with a region greater than 100 pixels squared. The image embeddings extracted during training are then prompted to the SAM as the first interaction; since the image embeddings of CLIP are trained to match the text embeddings, the SAM is trained on the image embeddings but uses the text embeddings for inference. In other words, CLIP's text embedding is used for inference. That is, during inference, the text is run through CLIP's text encoder, and the result is embedded in the SAM as a prompt.

The results in Figure 7 show that segmentation is possible based on simple text such as "wheels" and phrases such as "beaver's tooth grill." Even if SAM is unable to segment correctly from text prompts alone, additional points allow it to modify its predictions.

Figure 7 Result of mask generation from text

Summary

New tasks (promptable segmentation), models (SAM), and datasets (SA-1B) that enable leaps forward for image segmentation are the main contributions of the Segment Anything project, although SAM can perform many tasks with zero shots, It is not clear how to design simple prompts that implement semantic and panoptic segmentation.

The paper states that whether or not SAM will gain the status of a basic model will depend on how it is used in the community, but we can confirm that it has been incorporated into various tools since its publication, and we believe that this trend will continue in the future.