Can The Robustness Gained From ImageNet Training Be Used For Downstream Tasks In Transition Learning?

Robust 29/08/2022

3 main points
✔️ Architectural differences are relevant for robustness transitions
✔️ Transformer architecture is more effective than CNN with data augmentation under the condition that all layers are re-trained
✔️ Transition from ImageNet for image classification is more difficult than object detection or semantic segmentation

Does Robustness on ImageNet Transfer to Downstream Tasks?
written by Yutaro Yamada, Mayu Otani
(Submitted on 8 Apr 2022)
Comments: CVPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In recent years, there has been a lot of research on image recognition models.
Among themImageNet serves as an important benchmark, and many ImageNet-based ImageNet has been an important benchmark, and many models and learning techniques have been developed based on ImageNet.

This accuracy for ImageNet has been considered as a proxy for measuring the progress of machine learning systems, but the problem has been identified that there exists a lack of robustness where the accuracy of the model drops significantly when noise is added to the image.

One way to improve the robustness of the model to solve these problems is to use data extensions ( ANT, AugMix, DeepAug, etc. ).
Data augmentation aims to improve the robustness of a model by training with additional data that is artificially corrupted by applying some transformation to the original training data.

However, there has not been much research on the effect of transfer learning focusing on the robustness of the model. On the other hand, in the actual use of models, it is common to perform transfer learning, which utilizes the feature extraction capability of models that have been pre-trained on a large-scale dataset such as ImageNet.
In transfer learning, the higher the accuracy of the original image recognition model, the higher the feature extraction ability of the model. and the higher the accuracy after transfer learning. However, there has not been much research on the effect of transfer learning on robustness.

Therefore, in this research, we propose a new method called " of a highly robust model on a major image recognition dataset. the model that is highly robust on the main image recognition dataset, is the model necessarily robust to downstream tasks? We investigated the question "If a model is highly robust on a major image recognition dataset, does it necessarily exhibit robustness to downstream tasks?

In the next and subsequent chapters, we will briefly explain transition learning and ViT (Vision Transformer) as prior knowledge and then describe the experimental contents and results.

What is transfer learning?

In transfer learning, we reuse a learned model by diverting parameters from a model that has learned some data in advance.
Here, the data learned in advance is the source data, and the model from which the source data was learned in advance is the source model.
The data to be learned next is the target data, and the model to be learned is the target model.
The source model can detect features of the source data by learning, and by reusing the source model, learning can start from a state where features common to the source data can be detected from the target data, which makes it possible to create a highly accurate model with a small amount of learning. This method makes it possible to create highly accurate models with a small amount of training.

As shown in the figure below, there are two types of transfer learning methods: one is to update the parameters of the source model with the target data without updating the parameters of the source model, and the other is to relearn all layers with the target data.

What is ViT (Vision Transformer)?

What is ViT (Vision Transformer)? An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale ViT (Vision Transformer) is an architecture for image recognition proposed in It is an architecture for image recognition proposed in An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, which uses the Transformer used in natural language processing for image recognition tasks.

ViT is an ImageNet/ ImageNet/Real task, and has also succeeded in reducing the computational complexity to 1/15 of that of the SoTA model, and has attracted much attention in recent years.

Previous Studies and Research Objectives

In a previous study using the Swin Transformer, The Swin Transformer is a type of ViT, and we compared the accuracy of the Swin Transformer and CNN on ImageNet-C ( ImageNet images with transformations applied, see reference image below) in a previous study using Swin Transformer is superior to CNN It is also reported that the Swin Transformer is superior to the CNN.

From the result of the previous study. It is suggested that ViT is more tolerant to noise than CNN.
On the other hand, it is also reported that when data extensions are used for the data that CNNs are trained on, the accuracy of ViT is comparable to that of the Swin Transformer.

From these results, the experiments in this paper were conducted to confirm the following two points.

Can CNNs use data augmentation to transfer robustness to downstream tasks during transition learning?
Which of the differences between data extensions and architectures affect the robustness transition?

Experimental conditions and contents

experimental conditions

For ImageNet-1k, two CNNs, one pre-trained using the data augmentation method ANT and the other pre-trained using the data augmentation method CNNs pre-trained using DeepAug and AugMix, as well as Swin Transformer pre-trained using ImageNet-1k.
However, we did not use any data extensions for the data augmentation was not used to pre-training the Swin Transformer.

Here, we used ResNet50 (25M parameters) as the CNN to keep the model size comparable and Swin Transformer as the Swin Transformer-Tiny (parameter count: 28M), respectively, were used for the experiments.
We also used Mask-RCNN for the object detection task and UperNet as the head for the semantic segmentation task.

Experimental details

Measured how robust the ImageNet classification model is to downstream tasks in terms of transfer performance.
To measure the robustness of transfer performance to downstream tasks, the ImageNet-C introduced in ImageNet-C, we performed the following tasks on 15 transformed images classified into four categories: "noise", "blur", "weather", and "digital". The model was evaluated on 15 transformed images classified into four categories: "noise", "blur", "weather", and "digital". The accuracy of the model was accuracy when the model was evaluated on 15 different transformed images The robustness of the ImageNet classification model to noise was measured by calculating how much the accuracy degrades when the model is evaluated on 15 different transformed images classified into the four categories The results are shown in Table 1.
Specifically, we used the following two equations to define the average performance degradation and relative performance of the model, which we used to evaluate the robustness.
We also used MS-COCO for object detection, ADE20K for semantic segmentation, and CIFAR-10 for image classification as the datasets used for the downstream tasks.

Results and Discussion

First, we show the experimental results when we used the following method for transition learning. The experimental results are shown below, where the parameters of the source model are not updated, but the target data is used to update the parameters other than the ones appropriated from the source model.
The upper figure shows the accuracy degradation for the object detection task, and the lower figure shows the accuracy degradation for the semantic segmentation task. the bottom figure is a table summarizing the accuracy degradation for the semantic segmentation task.
Here, each method has the following settings.

ResNet50 pre-trained with Regular:ImageNet (clean images) as source model
ANT: ResNet50 pre-trained with data transformed by ANT, a data augmentation method, as a source model
DeepAug+: ResNet50 is a source model pre-trained with data transformed by DeepAug+, a data augmentation method.
Swin Transformer-Tiny pre-trained on Swin-T: ImageNet (clean images) as source model

This result confirms that CNN with data augmentation (DeepAug+, ANT) has a lower degree of accuracy degradation and higher robustness than Regular.
In addition, Swin-T outperforms CNN with data augmentation for some noises. accuracy than CNN with data augmentation.
In addition to this since Swin-T sometimes outperforms CNNs using data augmentation, the difference in architecture may be related to the robustness transition.

Second, in transition learning The experimental results when all layers are re-trained using the target data are shown below.
Re-training all layers may cause a loss of robustness of the source model, and in some conditions in the first experiment Since Swin-T outperforms CNNs using data augmentation we can also consider that the Transformer architecture is more effective than CNNs with data augmentation.

CNN with data augmentation (DeepAug+, ANT) shows a lower degree of accuracy degradation than Regular. robustness.
We also confirmed that Swin-T shows slightly higher robustness than ANT.
Furthermore, Swin-T showed the best performance in object detection and semantic segmentation.
This indicates that DeepAug+ and ANT are less able to transfer the robustness of ImageNet-C to downstream tasks than Swin-T, while under conditions where all layers are re-trained. Transformer architecture is more effective than CNNs using data augmentation The results also confirm that the Transformer architecture is more robust than the CNN with data augmentation.

We also tested the robustness of ImageNet-C to CIFAR10 and found that CNNs using data augmentation do not outperform Regular.
This suggests that the transition from ImageNet to image classification, rather than object detection or semantic segmentation, is more difficult than object detection and semantic segmentation.

summary

In this study, we investigate the question of whether a model that is highly robust on a major image recognition dataset when learning transitions using pre-trained parameters is necessarily robust to downstream tasks.

Experimental results show that for fixed feature transition learning, the robustness of the ImageNet backbone is partially preserved in the downstream tasks.
However, for more practical transition learning, where all layers are re-trained, we found that the contribution of the Transformer architecture is more important than the effect of data expansion on the CNN.

Categories related to this article

黒田峻弘