Catch up on the latest AI articles

How Successful Is The Self-Supervised Model In Downstream Tasks?

How Successful Is The Self-Supervised Model In Downstream Tasks?

Self-supervised Learning

3 main points
✔️ Compare various self-supervised learning methods
✔️ Compare performance on downstream tasks such as Few-Shot image recognition, object detection, and dense prediction tasks
✔️ Discover a variety of information, including correlations with performance in ImageNet

How Well Do Self-Supervised Models Transfer?
written by Linus EricssonHenry GoukTimothy M. Hospedales
(Submitted on 26 Nov 2020 (v1), last revised 29 Mar 2021 (this version, v2))
Comments: CVPR 2021.

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Self-Supervised Learning models in computer vision have made remarkable progress in recent years and have shown performance comparable to or superior to supervised learning models, including SimCLR and MoCo.

So how do pre-trained models on ImageNet using variously supervised and self-supervised learning methods perform on a variety of downstream tasks? Does self-supervised learning always perform better than supervised learning, or does self-supervised learning lag behind in certain tasks? And does the superiority or inferiority of various self-supervised learning methods vary depending on the task and dataset?

To answer these questions about self-supervised learning methods, the paper presented in this article evaluates the self-supervised and supervised learning methods proposed to date on a variety of downstream tasks. Let's take a look at some of the results we obtained.

About the experiment

About the method to evaluate

The self-supervised learning method to be evaluated in the experiment is as follows.

Contrastive learning method (Contrastive)

Clustering method (Clustering)

For these methods, we use the ResNet50(1x ) pre-training model as the backbone feature extractor for the downstream task. We also use a pre-trained model of ResNet50, available from PyTorch, as a supervised learning baseline for comparison.

As for the model, the number of parameters in the backbone is 23.5M, and it is trained on the ImageNet training set consisting of 1.28 million images.

Regarding the settings at the time of pre-training, there are the following differences in terms of training time and data augmentation.

When evaluating a downstream task, we add task-specific heads to the backbone and perform label prediction on the target task.

At this time, only the heads are optimized or the entire network is fine-tuned.

On the downstream task in the experiment

The tasks used in the experiment can be divided into four main categories.

  • Many-Shot recognition (sufficient amount of labeled data is available on the target domain)
  • Few-Shot recognition (only a few examples of labeled data are available on the target domain (
  • object detection
  • Dense prediction tasks (surface normal estimation and semantic segmentation)

Note that for the first two tasks, there are benchmarks with large domain shifts compared to the source data, ImageNet and that the latter two tasks are different from the training time, and the optimal features may differ from image recognition.

About Many-Shot Recognition

Experimental setup

The data set used in the experiment is as follows.

We evaluate these data sets in two different settings: linear and fine-tuning.

In Linear, we fit multinomial logistic regression to the features extracted in the backbone.

In fine-tuning, students learn 5,000 steps by SGD using the Nesterov momentum method.

experimental results

The results are as follows. Bolded text indicates the first place result and underlined text indicates the second-place result.


With the Linear setting, the following results were obtained.

  • For all downstream tasks except the Pets task, the best self-supervised learning method outperformed the supervised pre-training model in ImageNet.
  • The results on ImageNet show that supervised learning performed the best, which indicates that the self-supervised learning method learns a more generic feature representation.
  • DeepCluster-v2, BYOL, and SwAV generally rank high on the list.


  • Supervised learning came out on top in the three downstream tasks, showing better results than in the Linear setting.
  • DeepCluster0v2, SwAC, and SimCLR generally performed well, and overall the best self-supervised learning methods outperformed supervised learning.

About Few-Shot Recognition

Experimental setup

The data set used in the experiment is as follows.

  • The same data set as for Many-Shot recognition, except for Pascal VOC2007.
  • Broader Study of Cross-Domain Few-Shot Learning (CD-FSL: a dataset consisting of the following four)
  •   CropDiseases
  •   EuroSAT
  •   ISIC2018
  •   ChestX

The four datasets included in the CD-FSL consist of images with low similarity to natural images.

In our experiments, we use Prototypical Networks for the features extracted by the backbone.

experimental results

The experimental results for the 5-way 20-shot configuration (excluding CD-FSL) are shown in the table below.

  • For all datasets except DTD/Flowers, the supervised learning model showed the top results.
  • In Aircraft/Cars, supervised learning showed the top results by a particularly large margin.
  • Among the self-supervised learning methods, BYOL and DeepCluster-v2 are the best, with SwAV/SimCLR-v2 in the next best position.

On the other hand, the results in CD-FSL are as follows.

  • For all four data sets, several self-supervised learning models outperformed the supervised learning models.
  • For CropDiseases, which are the most similar to ImageNet, the same model as during Many-Shot performed better.
  • PCL-v1 has consistently been the worst performer.
  • The results from ISIC show a very different ranking of each method compared to the other datasets.

About object detection

Experimental setup

We use Pascal VOC as the dataset and Faster R-CNN with Feature Pyramid Network as the backbone for the pre-training model.

We also experiment with two different settings: freezing the backbone (all but the last residual block) (Frozen) and fine-tuning all layers end-to-end (Fine-tune).

experimental results

The results are as follows.

  • The best self-supervised learning methods outperformed the supervised learning models.
  • However, the models that showed superior results compared to the Many/Few-Shot recognition task were quite different.
  • SimCLR showed the best results in the Frozen setting and BYOL in the Finetune setting.

On dense forecasting tasks

Experimental setup

NYUv2 was used as the dataset for surface normal estimation, and PSPNet was trained using ResNet50 as the backbone.

We also used ADE20K as the data set for semantic segmentation and trained it using UPerNet.

experimental results

The results are as follows.

  • For both tasks, the best self-supervised learning methods outperformed the supervised learning models.
  • SimCLR-v2 and BYOL performed better for surface normal estimation, while PCL-v1 performed better for semantic segmentation.
  • We found little correlation between the performance of self-supervised learning in semantic segmentation and its performance on ImageNet.

Does the performance improvement in ImageNet lead to performance improvement in downstream tasks?

Performance on ImageNet has been the main benchmark in the evaluation of self-supervised learning. So, does performance on ImageNet have a clear correlation with performance on downstream tasks?

For this question, the correlation between the performance on the ImageNet-target task was as follows.

(Kornblith in the figure shows the data set used during Many-Shot.)

A plot of the performance of each method for each data set is also shown below.

From these results, we can see that

  • For Many-Shot recognition, the correlation between ImageNet and the downstream task was higher.
  • In Few-Shot recognition, the correlation was higher when the domain shift was smaller and weaker when the domain shift was larger.
  • For object detection, the AP50 showed the highest correlation, with the Frozen setting showing a stronger correlation than Finetune.
  • Weak correlations were consistently found in the surface normal estimates.
  • For semantic segmentation, the correlations were generally weak and no correlation was found for the ranking of each method.

In general, the practical findings can be summarized as follows.

  • For recognition tasks that do not have a large domain shift from ImageNet, it is effective to use methods that show excellent results in ImageNet for downstream tasks, regardless of whether they are Many/Few-Shot, and there is a possibility that self-supervised learning will surpass supervised learning.
  • For object detection and dense prediction tasks, there are self-supervised learning methods such as SimCLR-v2 and BYOL that can show excellent results, but the correlation between ImageNet and downstream tasks is not always high, so the best model in ImageNet is not necessarily effective for downstream tasks.
  • For datasets with large domain shifts from ImageNet (including unstructured images and textures), there is no clear rationale to choose a self-supervised learning method and a task-by-task comparison should be made.

The ranking of each method in the various downstream tasks was as follows.

It can be seen that although there are methods that show relatively good results overall, a generic method that shows the best results for all downstream tasks has not yet been realized.


In this article, we have presented papers on evaluation experiments of various self-supervised learning methods on a variety of downstream tasks.

The results clearly show that the best current self-supervised representation learning methods can outperform supervised learning. Correlations between performance on ImageNet and performance on various downstream tasks were also checked, and it was found that depending on the task and the distribution of the dataset, there may be no clear correlation or a weak correlation.

However, it does not include evaluation experiments on domain-specific self-supervised learning methods trained on each target dataset, leaving some work to be done in the future.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us