Self-supervised Learning Improves Self-supervised Learning!!!

Self-supervised Learning 20/06/2022

3 main points
✔️ Proposed Hierarchical PreTraining (HPT) for hierarchical self-supervised learning
✔️ Validation experiments on as many as 16 diverse datasets
✔️ HPT enables 80x faster learning and improved robustness

Self-Supervised Pretraining Improves Self-Supervised Pretraining
written by Colorado J. Reed, Xiangyu Yue, Ani Nrusimha, Sayna Ebrahimi, Vivek Vijaykumar, Richard Mao, Bo Li, Shanghang Zhang, Devin Guillory, Sean Metzger, Kurt Keutzer, Trevor Darrell
(Submitted on 23 Mar 2021 (v1), last revised 25 Mar 2021 (this version, v2))
Comments: WACV 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Self-supervised learning (SSL) is known to be effective in various image recognition tasks. However, it requires rich data and computational power to fully exploit its capabilities.

Therefore, in most cases, we will use a model that has been pre-supervised and trained by ImageNet, as shown in the figure below.

However, it is known that when a model that has already been self-supervised trained on ImageNet is transferred to a new image recognition task, the performance will be degraded if the images used in the task (e.g., medical images or aerial photos) have different features from those in ImageNet.

In this paper, we propose Hierarchical PreTraining (HPT), which is a hierarchical self-supervised learning method to solve this problem, as shown in the figure below. The HPT is to iteratively perform SSL on the Base data, then SSL on the Source data, then SSL on the Target data, and so on, to gradually approach the target task.

Here, Base data refers to a large dataset (ImageNet), Source data refers to a dataset that is relatively large and has similar characteristics to Target data, and Target data refers to a dataset in the target task. In this paper, the effectiveness of HPT is confirmed by conducting validation experiments using as many as 16 diverse datasets. We describe the experiments and the results of validating the effectiveness of HPT.

experimental setup

data-set

A total of 16 datasets, including 5 domains, are used in the experiment.

comparative approach

We compare four methods with different SSL processes including HPT. We also use MoCo-v2 as the SSL.

Base: SSL using ImageNet (only the Batch Normalization layer is updated using Target data)
Target: SSL with Target data
HPT (proposed method): SSL using ImageNet -> (SSL using Source data) -> SSL using Target data
HPT-BN: SSL using ImageNet -> (SSL using Source data) -> SSL using Target data, updating only Batch Normalization layer

experimental results

Separability analysis

We validate the results of identification with a linear discriminator using the features learned by SSL.

A linear discriminator with features learned by SSL as input is used to learn with labels. The performance of the linear discriminator itself is not very strong. Therefore, the better the features are extracted by SSL, the better the performance of the linear discriminator.

The experimental results are shown in the figure above. Target data is shown above each graph, where the horizontal axis represents the number of updates of the linear discriminator and the vertical axis represents its performance (Accuracy or AUROC).

In this experiment, we did not perform SSL by Source data in HPT and HPT-BN but performed SSL by ImageNet -> SSL by Target data. The following is what we confirmed from the experimental results.

HPT converged on 15 out of 16 datasets with performance equal to or better than Base and Target.
HPT converged 80 times faster than Base and Target. (HPT converged in 5k steps, while Base and Target converged in 400k steps)
In DomainNet quickdraw, the performance of HPT was inferior to that of Target, and the reason was thought to be the large feature difference between ImageNet and DomainNet quickdraw.

Semi-supervised transferability

We verify the performance of each method when semi-supervised learning is performed.

After performing self-supervised training, fine tuning is performed using 1000 labeled data randomly selected from the Target data. However, we select the data so that one piece of data from each class is included.

The above figure shows the experimental results, where B represents Base and T represents Target.

Also in this experiment, SSL by Source data is not performed in HPT and HPT-BN, but SSL by ImageNet -> SSL by Target data is performed.

The following are confirmed from the experimental results.

HPT converged to perform better than Base and Target on 15 out of 16 datasets (excluding DomainNet quickdraw).
HPT-BN did not exceed the performance of HPT.

Sequential pretraining transferability

We verify the performance of each method when transfer learning is performed.

The above figure shows the experimental results, where B represents SSL with Base data, S represents SSL with Source data, and T represents SSL with Target data. For example, B+S represents SSL with ImageNet -> SSL with Source data, and B+S+T represents SSL with ImageNet -> SSL with Source data -> SSL with Target data.

In addition, the Source and Target data used are displayed at the top of each graph. For example, the left graph shows that ImageNet is used as the Base data, Chexpert as the Source data, and Chet-X-ray-kids as the Target data. The following is what we could confirm from the experimental results.

The best performance was obtained when SSL was performed with B+S+T (i.e., HPT).

Augmentation robustness

We test the robustness of data extensions when performing SSL. We perform SSL with fewer data extensions and use the features obtained after training to discriminate with a linear discriminator.

The data extensions used are RandomResizedCrop, ColorJitter, Grayscale, GaussianBlur, and RandomHorizontalFlip.

Also in this experiment, SSL by Source data is not performed in HPT, but SSL by ImageNet -> SSL by Target data.

The figures above show the results of our experiments. In each graph, the type of data extension used decreases as we go to the right. The following are confirmed from the experimental results.

Compared to Target, HPT maintained higher performance with reduced data extensions used.
HPT, when using Chexpert data (right panel), showed a performance decrease as we reduced the number of data extensions used, but it did not fall below Target's performance.

Pretraining data robustness

We test the robustness of the Target data used in self-supervised learning against the number of data.

The above figures show the experimental results. In each graph, the number of Target data used in SSL increases as we go to the right. The following is what we could confirm from the experimental results.

HPT outperformed the other methods for the smaller number of data available.
HPT-BN was superior to other methods when the number of available data was less than 5k.

summary

In this article, we introduced HPT for hierarchical SSL. We found that HPT is a simple yet powerful technique through verification experiments. We expect further development of HPT because it is a practical method that is easy to implement and saves both amounts of data and computation.