Catch up on the latest AI articles

Vision Transformers Are The Future.

Vision Transformers Are The Future.


3 main points
✔️ A new self-supervised learning algorithm for computer vision
✔️ Highly compatible with vision transformers (and CNNs)

Outperforms supervised learning, and other SSL algorithms on several datasets

Emerging Properties in Self-Supervised Vision Transformers
written by Mathilde CaronHugo TouvronIshan MisraHervé JégouJulien MairalPiotr BojanowskiArmand Joulin
(Submitted on 29 Apr 2021 (this version), latest version 24 May 2021 (v2))
Comments: accepted by arXiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)



Transformers have performed very well even in the field of computer vision. The performance of the Vision Transformer(ViT) is comparable to state-of-the-art CNN models. However, they are computationally more expensive and data-hungry. Therefore, they have yet to prove that they are a better choice than CNNs. Transformers in NLP are trained on large corpuses using self-supervised methods, as compared to models in vision which are usually trained on supervised data. We believe that the image-level supervision reduces the visual information contained in an image to a single concept and limits the ability of the transformers.

In this paper, we introduce vision transformers trained using self-supervised methods, which have useful properties that are absent in supervised ViT and CNNs. The images above have been taken from a vision transformer trained with no supervision! Our model is able to automatically learn class-specific features i.e. unsupervised object segmentation. The features produced by the self-supervised transformer perform well with even a simple k-NN classifier, without any fine-tuning, or data augmentation. Nevertheless, the simple classifier was able to achieve 78.3% top-1 accuracy on ImageNet.


Our method, DINO stands for Distillation with no labels. Knowledge Distillation is the process of training a student network(gθs) using the predictions generated by the teacher network(gθt) as training labels to the student network. For any input image x, both networks generate a probability distribution over K dimensions denoted by Ps and Pt. These distributions are normalized using a softmax function with temperature T. The objective is to minimize the cross-entropy loss of the teacher network wrt the student network.

where, H(a,b) = -alog(b).

First, several distorted views/crops 'V' is generated from an image x. V contains two global-views{x1g,x2g} (covering more than 50% area) and several local-views(crops). The global views go through the teacher, while the local views go through the student. Therefore, we minimize,

Both the teacher and student network have the same architecture, but with different parameters. In the case of knowledge distillation, the teacher model is usually trained on supervised data, but we will only use unsupervised data in our case. Through experiments, we found that updating the parameters of the teacher model as the exponential moving average of the student network's parameters works quite well: θt ← λθt + (1 − λ)θs. Here λ is cosine scheduled to increase from 0.996 to 1 during training. 


The neural networks are made of a backbone f composed of either ViT or ResNet. The features generated by the backbone are passed through a 3-layered MLP with the hidden dimension of 2048, followed by an L2-normalized and a weight-normalized fully connected layer with K dimensions. Unlike CNNs, ViT does not use batch normalization and neither does the MLP head. Therefore, DINO is a BN-free model.

Avoiding Collapse

Models trained using self-supervised algorithms are prone to collapse and cheat. For example, both the models could uniformly predict the same outputs across all dimensions, or the outputs could be dominated by just one dimension. SSL methods use contrastive loss, clustering constraints, and other techniques to prevent such a collapse. Although those methods are equally applicable with DINO, a simpler approach of just centering and sharpening the momentum teacher outputs works well.


As shown in the algorithm above, only the teacher network's outputs are centered and sharpened, and the center 'c' is updated as the exponential moving average of the output of the teacher. Sharpening is obtained by using a low value of temperature (tpt) for the teacher. Centering avoids the collapse due to a dominant dimension, but it also induces uniform output across all dimensions. Sharpening, on the other hand, prevents uniform predictions across all dimensions but induces a particular dimension to dominate. So, they both compliment each other.

Experiment and Evaluation

All the models have been trained on the ImageNet dataset. We primarily use three types of modes: ViT,  ResNet(RN), and DeiT. Following the standard protocols for self-supervised learning, we test the trained models by freezing the model parameters and training a linear classifier, or fine-tuning the model on downstream tasks. Both these methods were found to be sensitive to hyperparameter changes. Therefore, we evaluate the frozen models with k-NN classifiers, by taking k=20.

The above table shows the results of models trained with various SSL algorithms. DINO is equally effective with both transformers(DeiT, ViT), and CNNs(RN). The table at the bottom shows that reducing the patch size to 8x8 gives the best results(80.1%) while also reducing the computation time of a forward pass. ViT-B/8 is 1.4x faster and has 10x fewer parameters than the previous state-of-the-art (SCLRv2). DINO also surpasses the performance of supervised training, and other SSL methods on tasks such as video object segmentation, copy detection, and image retrieval, transfer learning. Please refer to the original paper for more details on those experiments.


The paper shows that DINO is an effective method to train vision transformers. It is a flexible algorithm that works well with all types of models, data augmentation, and collapse prevention techniques. The pre-trained BERT model can be fine-tuned for a variety of tasks in NLP. DINO could help build a BERT-like model for computer vision using large amounts of raw image data, enabling a single vision transformer to perform well on a small dataset across a large range of tasks.

Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us