Catch up on the latest AI articles

Mask R-CNN: Efficient Detection Of Objects In Images

Mask R-CNN: Efficient Detection Of Objects In Images

Computer Vision

3 main points
✔️ We propose Mask R-CNN, a multi-task learning model for object detection.
✔️ The model predicts object location, bounding boxes, segmentation, and keypoints simultaneously with high accuracy and outperforms other methods on the COCO dataset.

✔️ The flexibility of the Mask R-CNN provides significant progress in detection and segmentation tasks, allowing for fast and effective training.

Mask R-CNN
written by Kaiming HeGeorgia GkioxariPiotr DollárRoss Girshick
(Submitted on  20 Mar 2017 (v1), last revised 24 Jan 2018 (this version, v3))
Comments: open source; appendix on more results
Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Mask R-CNN extends Faster R-CNN with a framework for simultaneous object detection and high-quality segmentation. Easy to train, fast to run, and applicable to other tasks, it shows top results on different tracks of the COCO suite and outperforms other models even without extra features. As a simple and effective approach, it can serve as a foundation for future research.


As mentioned above, Mask R-CNN is an effective and flexible framework for simultaneous object detection and high-quality segmentation. Extended from the existing Faster R-CNN by adding a mask prediction branch, a new layer called RoIAlign provides precise alignment Mask R-CNN is simple and fast to run, outperforming traditional models in the COCO segmentation task, It also shows superior performance in object detection. Its flexibility and accuracy make it a promising framework for future research and extension to more complex tasks.

Related Research

Region-based CNN (R-CNN) approaches, such as Faster R-CNN, have evolved to provide flexible and robust performance in object detection. In instance segmentation, traditional methods have had challenges with time-consuming and low accuracy, but the latest Mask R-CNN predicts segment and class labels simultaneously, achieving high performance results in a simple and flexible method. Unlike other methods, Mask R-CNN employs an instance-first strategy for segmentation, which is expected to be developed in the future.

Masking the R-CNN

Mask R-CNN adds an object mask output branch to the fast R-CNN, simultaneously predicting the object mask along with class labels and bounding box offsets for each candidate object. This conceptually simple approach contributes to the extraction of detailed spatial layouts where traditional methods lack, primarily in pixel-to-pixel alignment.

Faster R-CNN

Faster R-CNN consists of two stages. In the first stage, the Region Proposition Network (RPN) proposes bounding boxes for candidate objects, and in the second stage, the Fast R-CNN extracts features from these boxes and performs classification and bounding box regression. Shared features are used to speed up inference.

Mask R-CNN

Mask R-CNN employs the same initial stage (RPN) as Faster R-CNN, with the second stage generating binary masks as well as class and box offset predictions for each RoI. Unlike regular systems, class prediction and mask generation are separated, so that masks for each class are generated without conflicts during training. This results in superior instance segmentation.

Mask representation

Masks represent the spatial arrangement of objects. Unlike regular fc layers, masks maintain correspondence between pixels through convolution and naturally capture spatial structure. Compared to conventional methods, it has fewer parameters and is more accurate. To support this pixel-to-pixel behavior, the authors introduced the RoIAlign layer to enable more accurate and less parameterized mask prediction.


RoIPool typically uses quantization in extracting a small feature map from each RoI. However, this quantization affects the per-pixel mask prediction. The proposed RoIAlign layer avoids quantization and uses bilinear interpolation to compute accurate feature values, allowing for refined mask prediction RoIAlign provides a significant improvement over RoIPool.

Network architecture

The authors implemented mask R-CNN on different architectures and distinguished between convolutional backbones and network heads; in addition to backbones such as ResNet-50 and ResNeXt, Feature Pyramid Network (FPN) was also employed, with FPN backbones improves accuracy and speed. The network head extends the convolution mask prediction branch, and the ResNet-C4 backbone head includes a fifth stage of ResNet, resulting in a more efficient head for FPN.

・Implementation Details

The authors set hyperparameters based on Fast/Faster R-CNN studies and trained the model using image-centered training and appropriate sampling ratios. In training, RoI is defined as positive when IoU is greater than 0.5, and mask loss is defined only with positive RoI. During inference, the number of suggestions and mask processing are optimized for fast and accurate detection.

Instance Segmentation

A comprehensive comparison of masked R-CNN and other state-of-the-art techniques is performed on the COCO dataset and evaluated with standard metrics (AP, AP50, AP75, APS, APM, APL, etc.). Training is reported on 80,000 training images and 35,000 subsets, with ablation reported on 5,000 validation images. Results are also reported in test-dev.

Main Results

In Table 1, Mask R-CNN is compared to state-of-the-art methods in instance segmentation, showing that Mask R-CNN with the ResNet-101-FPN backbone outperforms other models. An example of visual results is also shown, highlighting that Mask R-CNN performs better under challenging conditions and has fewer artifacts than other methods.

・Ablation Experiments

A number of ablations are performed to analyze the mask R-CNN. The results are shown in Table 2 and discussed in more detail next.


Table 2(a) presents a comparison of Mask R-CNNs with different backbones. It is emphasized that deeper networks and advanced designs (such as FPNs and ResNeXt) contribute to performance gains, but it is pointed out that not all frameworks benefit equally from these elements.

Polynomial and independent masks

The mask R-CNN decouples the box and class predictions and generates a mask for each class without conflict because the existing box branch predicts the class labels. Table 2(b) compares this approach with per-pixel softmax and multinomial loss. In the alternative, the mask and class prediction tasks are combined, which reduces the performance of the mask. This suggests that once the instances are classified as a whole, it is sufficient to predict the binary mask without worrying about the categories, making training easier.

Class-specific and class-independent masks

Normal instantiation predicts one m×m mask for each class. Interestingly, Mask R-CNN with class recognition masks (i.e., predicting a single m×m output independent of class) has almost the same effect. In an approach that significantly separates classification and segmentation, the class-specific correspondence is 30.3 compared to 29.7 for the regular mask AP, suggesting an emphasis on the division of labor.


Table 2(c) shows the evaluation of the authors' proposed RoIAlign layer. In this experiment, using a ResNet50-C4 backbone with stride 16, RoIAlign improves APs by about 3 points over RoIPool, with many of its advantages coming at higher IoU (AP75). RoIAlign is not affected by maximum/average pooling and also compares favorably to RoIWarp, which also uses bilinear sampling. Additionally, RoIAlign using the ResNet-50-C5 backbone with large strides has also been evaluated and shown to significantly improve mask APs, and the use of large stride features improves detection and segmentation accuracy.

Finally, RoIAlign shows a 1.5-point mask AP and 0.5-point box AP improvement when combined with FPNs, taking advantage of finer multi-level strides. Especially when fine alignment is required, as in keypoint detection, RoIAlign shows significant accuracy gains even when using FPN (Table 6).

Mask branch

Segmentation is a pixel-to-pixel task and uses the ResNet-50-FPN backbone to compare a Multilayer Perceptron (MLP) to a Fully Convolutional Network (FCN). Using FCN yields a mask AP improvement of 2.1 points over MLP. For a fair comparison, the conv layer of the FCN head was chosen not to be pre-trained in this backbone (Table 2e).

・Bounding box detection results

In Table 3, Mask R-CNN is compared to state-of-the-art COCO bounding box object detection; Mask R-CNN with ResNet-101-FPN and ResNeXt-101-FPN outperforms previous state-of-the-art models, especially ResNeXt-101- FPN achieved a 3.0 point box AP improvement over the previous best single model entry. Furthermore, the masked R-CNN with RoIAlign outperforms the model without RoIAlign, albeit with a small gap in box detection. These results suggest that the authors' approach effectively bridges the difficulty gap between object detection and instance segmentation.


For inference, the ResNet-101-FPN model follows the Faster R-CNN training steps and shares features. The model is fast, running on an Nvidia Tesla M40 GPU in about 195 ms. Training also allows for fast prototyping, with ResNet-50-FPN taking 32 hours to complete and ResNet-101-FPN taking 44 hours. The authors hope that this rapid training will facilitate research and help many people get started in this field.

Mask R-CNN for human posture estimation

The authors' framework is also applicable to human posture estimation. A one-hot mask is used to predict the position of each key point using Mask R-CNN. Experiments demonstrate the flexibility of the Mask R-CNN framework and require minimal domain knowledge. Keypoint training uses a one-hot binary mask for each keypoint to minimize cross-entropy loss to the softmax output. The model is trained on a COCO trainval35k image, and inference uses a single scale of 800 pixels.

Main results and ablation

The author evaluated the performance of person keypoint detection and experimented with the ResNet-50-FPN backbone. The results show that the author's method achieved an APkp of 62.7, 0.9 points higher than the COCO 2016 keypoint detection winner, making it a simple and fast method. Furthermore, it can predict box, segment, and keypoints simultaneously, and adding a segment branch improves the APkp to 63.1.

Adding a mask branch to the box-only or keypoint-only version improves these tasks, but adding a keypoint branch slightly degrades the box/mask AP. This suggests that while keypoint detection benefits from multitask training, it does not affect the other tasks. Nevertheless, training all three tasks simultaneously allows the integrated system to efficiently predict all outputs simultaneously.

We also investigate the effect of RoIAlign on keypoint detection (Table 6).

Although the ResNet-50-FPN backbone has a fine stride, RoIAlign continues to outperform RoIPool, improving the APkp of keypoint detection by 4.4 points. This indicates that keypoint detection is sensitive to high localization accuracy, and we expect Mask R-CNN to be an effective framework for object bounding boxes, masks, and keypoint extraction.


This paper focuses on the "Mask R-CNN" model, which is gaining attention in the field of object detection. mask R-CNN performs pixel-level segmentation as well as object location, making it suitable for complex tasks. The method simultaneously performs detection, segmentation, and keypoint estimation, and performs well in a wide variety of applications. In the field of artificial intelligence, it is an example of a model that has contributed to the evolution of object recognition and segmentation and has performed well in real-world computer vision tasks.

Mask R-CNN has successfully integrated object detection and segmentation and has demonstrated excellent performance in a wide variety of applications. In the future, it is expected to improve model efficiency and achieve real-time processing. At the same time, it is important to develop models that are robust to domain adaptation and training with fewer labels. Advances in object detection technology are expected to lead to practical applications in a variety of domains, including self-driving cars, medical image analysis, and environmental monitoring.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us