ConvNeXt V2: Improvement And Scaling Of ConvNets With Mask Autoencoder

Image Recognition 03/04/2024

3 main points
✔️ ConvNeXt, a SOTA for CNN, was designed for supervised learning, but performance can be improved by combining it with self-supervised learning such as mask autoencoder (MAE)
✔️ Experimental results show that simply combining the two can improve performance degraded
✔️ Proposed MAE and new global response normalization (GRN) that can be added to ConvNeXt to significantly improve overall ConvNeXt performance

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
written by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie
(Submitted on 2 Jan 2023)
Comments: Code and models available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Innovations in neural network architectural design have played an important role in the field of image recognition. Convolutional neural networks (ConvNets), an alternative to manual feature engineering, have had a major impact on computer vision research by providing a generic feature learning method for a variety of visual recognition tasks.

The Transformer architecture was developed for natural language processing, but when applied to the field of image recognition, it has long dominated the top spot, outperforming ConvNets in terms of accuracy and scaling capabilities.

More recently, ConvNeXt (Liu et.al, 2022) has modernized traditional ConvNets, achieving the highest accuracy for image recognition tasks and showing that pure convolutional models are also a scalable architecture.

In order to explore better designs for neural nets, one approach is expected to combine not only supervised learning with ConvNeXt, but also with self-supervised learning, such as mask autoencoder (MAE). However, there are two challenges in combining these two approaches.

One is that MAE has a specific encoding-decoding design optimized for the transformer's sequence throughput, which may make it incompatible with standard ConvNets. In addition, previous studies have shown that training ConvNets using mask-based self-supervised learning is difficult.

In this commentary paper, we propose a MAE and a new global response normalization (GRN) layer that can be added to ConvNeXt to significantly improve overall ConvNeXt performance and achieve SOTA results on the ImageNet data set.

Fully convolutional mask autoencoder (FCMAE)

Figure 1: Overview of the proposed FCMAE

Masking

It employs random masking with a masking ratio of 0.6. The convolution model has a hierarchical design and features are downsampled at different stages.

The mask is generated in the last step and recursively upsampled to the highest resolution. Specifically, 60% of the 32x32 patches are randomly removed from the original input image. Data expansion is minimal and involves only random resize cropping.

Encoder Design

In this paper, ConvNeXt is used as the encoder. One of the challenges of effective masked image modeling is to ensure that the model does not learn shortcuts to easily copy and paste information from masked regions. This is a particular problem with ConvNets because of the need to preserve the 2D image structure.

A common solution is to introduce learnable mask tokens, but this can create problems with training and test consistency.

To solve this issue, the standard convolution layer in the encoder is replaced with a sparse convolution during pre-training, as shown in Figure 1. The sparse convolution layer can be converted back to standard convolution during the fine tuning phase without additional processing.

Decoder Design

It uses the lightweight and simple ConvNeXt block. While the encoder is heavy and hierarchical, this architecture is asymmetric throughout. More complex decoders were considered, but the decoder in a single ConvNeXt block performed better in reducing fine tuning accuracy and pre-training time (see Table 1). The decoder dimension is 512.

Global Response Normalization (GRN)

Figure 2: Visualization of feature activation

Figure 3. feature cosine distance analysis

The reasons for proposing GRNs are illustrated in Figures 2 and 3. As shown in Figure 2, several feature maps of ConvNeXt (ConvNeXt V1) in prior studies are invalid or saturated, making activation redundant across channels.

On the other hand, as shown in Figure 3, the deeper into the lower layers of ConvNeXt V1, the more similar the features in each extracted layer. This problem is especially acute when combined with the proposed FCMAE, where GRNs are considered a way to diversify features during training and prevent feature decay.

The GRN consists of three steps: 1) global feature aggregation, 2) feature normalization, and 3) feature calibration. First, the spatial feature map Xi is aggregated into a vector gx by means of a global function G:.

This can be viewed as a simple pooling layer. As shown in Table 2.a, we experimented with a number of different functions and found that the widely used feature aggregator, global mean pooling, did not perform well. Instead, we found that using norm-based feature aggregation with the L2 norm improved performance.

Next, the response normalization function N(-) is applied to the aggregated values. Specifically, we use the standard division normalization as follows

As with other forms of normalization, this step results in feature competition between channels due to mutual suppression. In Table 2b, we also consider the use of other normalization functions and find that simple division normalization is most effective. However, when applied to aggregated values in the same L2 norm, standardization shows similar results.

Finally, the computed feature normalization score is used to calibrate the original input response, as in the following equation

To facilitate optimization, add two additional learnable parameters, γ and β, and initialize them with zero. We also add a residual connection between the input and output of the GRN layer.

The resulting final GRN is: Xi = γ ∗ Xi ∗ N(G(X)i) + β + Xi . This setup allows the GRN to initially perform the identity function and gradually adapt during training. The importance of the residual connection is illustrated in Table 2c.

The effectiveness of GRNs is illustrated in Figures 2 and 3. The visualization results in Figure 2 and the cosine distance analysis in Figure 3 show that ConvNeXt V2 with GRNs effectively mitigates the problem of feature decay. The consistently high Cosine Distance values also confirm that feature diversity is maintained across layers.

Experiments on the ImageNet dataset

Importance of combining self-supervised learning with

The results in Table 3 show the importance of the proposed approach: using the FCMAE framework without modifying the model architecture has only an impact on the system of image recognition.

Similarly, the proposed GRN layer had little impact on performance under supervised settings. On the other hand, the combination of the two significantly improved fine tuning performance.

Table 3. Importance of Combining Self-Supervised Learning with

Model Scaling

In this study, we evaluated eight different models of varying sizes, ranging from the low-capacity 3.7M Atto model to the high-capacity 650M Huge model. These models were pre-trained using the proposed FCMAE framework and then compared the fine tuning results with their fully supervised counterparts.

The results shown in Figure 4 demonstrate that model scaling is powerful, consistently outperforming the supervised baseline across all model sizes. This is the first example of both the effectiveness and efficiency of masked image modeling being demonstrated over such a wide range of model domains.

Comparison with conventional methods

In this experiment, the proposed method was compared to previous mask autoencoder methods designed for transformer-based models. The results are summarized in Table 4.

The proposed method outperforms the Swin transformer pre-trained in SimMIM for all model sizes. Also, compared to the plain ViT pre-trained in MAE, the proposed method performs similarly in the large model domain despite having far fewer parameters (198M vs. 307M).

However, in the giant model domain, the accuracy of the proposed method is slightly lower than in previous studies. This is because the giant ViT models may benefit more from self-supervised pre-training. In the next experiments, additional intermediate fine tuning may make up for this difference.

Table 4.Comparison with conventional methods

Intermediate tuning of ImageNet-22K

Table 5 shows the intermediate tuning results for ImageNet-22K. The learning process consists of three stages: 1) FCMAE pre-training, 2) ImageNet-22K fine tuning, and 3) ImageNet1K fine tuning. Images with a resolution of 3842 were used for pre-training and fine tuning. Results are compared to state-of-the-art architectural designs, including convolution-based, transformer-based, and hybrid designs, to confirm that the proposed method achieved the highest accuracy.

Table 5. tuning results for ImageNet-1K with IN-21K labels

Experiments in transfer learning

In this experiment, we benchmark the performance of the transition learning. First, we will compare the results of ConvNeXt V1 + supervised learning and ConvNeXt V2 + FCMAE to verify the effectiveness of the proposed method. We will also compare the Swin transformer model with the approach pre-trained with SimMIM.

Object detection and segmentation on the COCO dataset

Table 6. object detection and instance segmentation results in COCO

Mask R-CNN (He et al., 2017) was fine-tuned on the COCO dataset to compute the detection mAPbox and segmentation mAPmask on the COCO val2017 set. Results are presented in Table 6.

As the proposed method is applied, there is a gradual improvement: moving from V1 to V2, the performance improves with the new introduction of GRNs. In addition, the model further benefits from better initialization when moving from supervised to FCMAE-based self-supervised learning.

The best performance is achieved when both are applied together. In addition, ConvNeXt V2, pre-trained in FCMAE, outperforms the Swin transformer model for all model sizes, with the largest differences obtained in the giant model region.

Semantic Segmentation in ADE20K

Table 7. semantic segmentation results with ADE20K

Finally, we experimented with the UperNet framework (Xiao et al., 2018) on a semantic segmentation task in ADE20K. Results show a similar trend to the object detection experiments, with the final model showing a significant improvement over the V1 supervised baseline.

It was also found to perform as well as the Swin transformer in the base and large model regimens, but outperforms the Swin in the giant model regimen.

Conclusion

In this commentary paper, we introduced a new ConvNet model family, ConvNeXt V2. This model is designed to cover a wider range of complexity. There are minimal changes to the architecture, but it is designed to be particularly well suited for self-supervised learning.

The use of a fully convolutional masked autoencoder pre-training allowed us to significantly improve the performance of pure ConvNets in a variety of downstream tasks, including ImageNet classification, COCO object detection, and ADE20K segmentation.