SpineNet, An AI-discovered Backbone Model With Outstanding Detection Accuracy
3 main points
✔️ Discovery of the ideal architecture by using NAS
✔️ 2.9% better average accuracy than ResNet-50-FPN for object detection tasks
✔️ "SpineNet" isn't choosing detection, classification, segmentation, and task. It also has a generalization performance.
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
written by Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song
(Submitted on 10 Dec 2019 (v1), last revised 17 Jun 2020 (this version, v3))
Comments: Accepted at CVPR2020
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Introduction
Many people have been surprised at the development of Deep Learning in the last few years. If I had to name one of them that has supported this development, it would be deep convolutional neural networks (DCNNs). Many models have come out of the discovery of convolutional neural nets. Many of them have evolved around the one idea that if you make the network architecture deeper and broader, the accuracy will increase.
However, despite all this increase in accuracy and development, there is a contradiction when it comes to meta-architecture: not much has changed since the invention of convolutional neural networks.
What this means is that the only design that can be used is to reduce the resolution of the input image and encode it to intermediate features. Despite the progress that has been made to this point, this idea has not changed. But while this idea is valid for the classification task, is it also valid for the detection task? In relation to this, Janlukan responds to this idea by stating that while high resolution may be needed to detect the presence of a feature, it is not necessary to detect its location with the same high accuracy. In other words, he's saying that it's good for the classification task, but it's not intended for the detection task. Incidentally, DetNet has certainly demonstrated that this idea of lowering the resolution is not suitable for detection.
Then look at the image below.
This image is the architecture of YOLO v4, the famous detection model. Notice something strange here. VGG16, ResNet, etc. are used in the Backbone part of the figure, despite the fact that it is described above as being unsuitable for detection. The idea for the Neck part was born from the idea that if the resolution is unsuitable to be lowered, then the resolution can be returned with a decoder afterwards. However, even if the decoder is used to return the resolution, the positional information is still missing. So the authors thought they could improve detection accuracy by designing a Backbone model that does not reduce this information.
Proposal methodology
We propose a new meta-architecture called the scale-permuted model. This has the following two advantages
- The resolution of the intermediate feature map can be increased or decreased at any time → The model can retain spatial information even when it is deep.
- Feature maps can be connected across feature map resolutions to enable multi-scale feature fusion.
It is somewhat similar to HRNet, but while HRNet is a normal parallel multiscale design, SpineNet is a freer multiscale design, as shown in the figure below.
On the left is the general meta-architecture and on the right is SpineNet. We can see that it is a fairly complex design. It is obvious that the search space for the design of a model that is disjointed down to network connections and location relationships can be very large. Therefore, in this paper we use neural architecture search (NAS:neural architecture search) for the search.
In other words, the paper suggests that theoretically, such a model (a scale-permuted model) could be designed to work well. However, the search space for such a model would be huge, so we let the AI find it.
NAS
It is based on RetinaNet as well as NAS-FPN. The only thing is that this paper is trained directly on the COCO detection dataset by combining the encoder and decoder into one for exploration.
(1) Re-ordering of resolution (scaled-permuted)
The order of the basic blocks in the network is important: using ResNet-50 as a baseline, we rearrange the bottleneck blocks to find the best one. By rearranging the intermediate and output blocks, respectively, we define the search space for the scale exchange so that the size of the space is (N-5)!5!
(2) Cross scale connection
We define two input connections for each block in the search space. It can be seen from the figure that there are no two input connections.
It's easy to say, but the scales are different, so we need to do the scaling process. The proposed cross scale connection is shown in the figure.
To increase the scale (line above)
- To reduce the computational complexity of resampling, the channel modulation coefficients are used for the α. (α = 0.5)
- Up-sampling with a nearest neighbor connection
To reduce the scale (bottom line)
- To reduce the computational complexity of resampling, the channel modulation coefficients are used for the α. (α = 0.5)
- 3x3 convolution and downsampling in stride 2
- maximum value pooling
Fusion is addition of elements. This is re-calculated every time you change the connection, so it's impossible for a human being to do it.
Experiment
The table below shows the evaluation of SpineNet in COCO.
The computational complexity of the model is almost the same as ResNet-50, as it only rearranges the order of the functional blocks in ResNet-50. The average accuracy (AP) is 2.9% better than ResNet-50-FPN for the object detection task. In addition, they were able to further FLOP the efficiency by -10% by adjusting the residual and bottleneck blocks used in the ResNet model.
In addition, SpineNet can be transferred to classification tasks as well, achieving a top-1 accuracy improvement of approximately 5% on the iNaturalist image classification dataset.
Not only that, we also showed improved accuracy in the instance segmentation task.
Summary
The paper builds a meta-architecture that combines the functions of encoder and decoder into one, which is somewhat similar to the idea of HRNet, but more flexible. Experiments have also demonstrated the generalizability of SpineNet, with good results in detection, instance segmentation and classification tasks. There is still a lot of work to be done; the way the NAS is defined will affect the accuracy of the sorting.
In the future, the era of searching for the optimal network may come with NAS. Until now, we have been working on a fixed form of network, but when the search space where all the network blocks that have emerged so far are also free to be connected, it is not surprising that an architecture with strong generalization performance will emerge. The era of automatic optimization of parameters, automatic creation of model architectures, and more and more AI itself creating optimal AI seems to be coming.
But when you get to this point, it's not at all clear what was important and what contributed to it. Suddenly there is no connection, or the connection flies, and it's impossible for humans to understand.
Categories related to this article