Catch up on the latest AI articles

Revisiting Weakly Supervised Pre-training In Visual Recognition Models

Revisiting Weakly Supervised Pre-training In Visual Recognition Models

Image Recognition

3 main points
✔️ Validate weakly supervised learning with hashtag monitoring
✔️ Compare weakly supervised learning with supervised and self-supervised learning
✔️ Outperforms self-supervised learning significantly in various transition learning settings

Revisiting Weakly Supervised Pre-Training of Visual Perception Models
written by Mannat SinghLaura GustafsonAaron AdcockVinicius de Freitas ReisBugra GedikRaj Prateek KosarajuDhruv MahajanRoss GirshickPiotr DollárLaurens van der Maaten
(Submitted on 20 Jan 2022 (v1), last revised 2 Apr 2022 (this version, v2))
Comments: CVPR 2022

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Pre-learning is a crucial element in computer vision tasks. Among them, supervised pre-training in ImageNet has become the de facto standard, but recent studies have shown that large-scale weakly supervised pre-training can outperform supervised methods.

In the paper presented in this article, we measured the performance of our hashtag-based weakly supervised pre-training method using state-of-the-art networks and large datasets and compared it with existing methods.

As a result, the weakly supervised pre-training model outperforms the self-supervised learning model significantly in various transition learning settings, showing the effectiveness of using weakly supervised pre-training.

Weakly supervised pre-training with hashtags

The weakly supervised pre-training method we test in the paper is based on hashtag supervision. The task is to predict the hashtags that have been added to an image by the contributor of the image.

This task has the following differences compared to the general image classification task.

  • Hashtags are inherently noisy.
  • Hashtag usage follows a Zipfian distribution.
  • Hashtags are inherently multi-labeled, typically with multiple hashtags for a single image.

Collecting Hashtag Datasets

The dataset used for training was built by collecting photos and hashtags published on Instagram.

This procedure consists of four steps.

  • Select and normalize frequently used hashtags to build a hashtag vocabulary.
  • Collect public images tagged with at least one of the selected hashtags.
  • The obtained images and associated hashtags are combined to create labeled samples that can be used for pre-training.
  • Resampling is performed from the obtained examples to obtain the desired hashtag distribution.

The fourth step, resampling, aims to reduce the proportion of frequent hashtags and increase the proportion of infrequent hashtags. This is achieved by resampling by the inverse of the square root of the hashtag frequency (a low-frequency image may appear multiple times in one epoch).

This resulted in a large dataset consisting of 3.6B images, and we named the full-size dataset IG-3.6B.

About Prior Learning

As for the architectures used in the experiments, ResNeXt, RegNetY, DenseNet, EfficientNet, and ViT were considered in the preliminary experiments, and the experiments in this study focused on RegNetY and ViT, which have the best performance.

During pre-training, we connect a linear classifier with $|C| \approx 27k$ number of classes to the output and train the model to minimize the cross-entropy loss between SoftMax's output predicted probabilities and the target distribution. (See the original paper 3. 2 for details on hyperparameters and other details.)

experimental setup

In our experiments, we consider different types of transition learning in image classification.

Specifically, we consider (1) transfer learning with linear classifiers, (2) transfer learning with fine-tuning, (3) zero-shot transfer learning, and (4) Few-shot transfer learning. We also perform experiments to compare the proposed weakly supervised learning method with fully supervised learning or self-supervised learning.


The dataset for pre-training is described earlier. In the transfer learning experiments, we use the dataset described below.

  • ImageNet1k
  • ImageNet5k
  • iNaturalist 2018
  • Places365-Standard
  • Caltech-UCSD Birds-200-2011(CUB-2011)

Please refer to the original paper Sec. 4.1 for hyperparameters during fine-tuning, etc.

experimental results

Comparison with supervised prior learning

To begin with, the comparison results with the supervised pre-training models (EfficientNet and ViT) are as follows.

As shown in the table, the weakly supervised learning model performed well, ranking first or second for all five downstream datasets. In addition, the trade-off between throughput and classification accuracy is illustrated graphically in the following results.

For EfficientNet with supervised pre-training, RegNetY with weakly supervised pre-training on the IG 3.6B dataset, and ViT, ViT shows high classification accuracy. Also, looking at the trade-off between accuracy and throughput, RegNetY shows good characteristics.

Comparison with Self-Supervised Prior Learning

Weakly-supervised pre-training with billions of large images is found to yield performance comparable to supervised learning. This result raises the question of whether weakly supervised learning has an advantage over self-supervised learning, which is easier to scale up.

To answer this question, we compare SimCLRv2, SEER, and BEiT. In particular, SEER is an important comparison for the training paradigm as it is a model trained on Instagram images.

The results on ImageNet-1k are as follows.

As shown in the table, we find that the performance is significantly better than the state-of-the-art self-supervised learning, especially when the number of samples is small (1% and 10%). (Note that these results are performance obtained from the literature, and the observed results may change as the dataset size of the pre-training model is increased.)

zero-shot transition learning

Weakly supervised models have the advantage of observing a variety of learning objects during pre-training. Based on this, we conduct experiments on zero-shot transition learning to test its ability to rapidly learn and recognize novel visual concepts. The results are as follows.

While it is important to note that many factors are different, the proposed weakly supervised model performs very well, suggesting that weakly supervised learning methods offer a promising path toward open-world visual recognition models.


Weakly supervised prior learning in image recognition is compared with supervised and self-supervised learning, and the superiority of weakly supervised learning is shown.

However, several limitations, including complex learning procedures, uniquely collected data sets, and the enormous computational effort required to replicate existing research, make it difficult to conduct controlled experiments to test the effects of a given variable. The paper also cites the limitations of comparing methods as a challenge, along with the fact that there are factors that cannot be determined by common metrics, such as the possibility that weakly supervised learning may reflect harmful stereotypes.

In general, the results show that weakly supervised learning methods can perform very well in image recognition, although the comparison between different methods has some limitations.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us