Catch up on the latest AI articles

Why Are Vision Transformers So High Performance?

Why Are Vision Transformers So High Performance?


3 main points
✔️ ViT has a more uniform representation (features) in all layers. In other words, the representation in each layer is similar.
✔️ ViT can aggregate global information early due to self-attention.
✔️ ViT propagates representations strongly from lower to higher layers.

Do Vision Transformers See Like Convolutional Neural Networks?
written by Maithra RaghuThomas UnterthinerSimon KornblithChiyuan ZhangAlexey Dosovitskiy
(Submitted on 19 Aug 2021 (v1), last revised 3 Mar 2022 (this version, v2))
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Convolutional neural networks (CNN) have been the de facto model for visual data. However, in recent years, Vision Transformer (ViT) has dominated the field of image classification tasks.

Why did ViT perform so well on image tasks? In this paper, we analyzed the internal representation of ViT and CNN to search for differences between the two architectures.

We found that the self-attention mechanism of ViT enables early consolidation of global representations and that residual connections (skip connections), which strongly propagate representations from lower to higher layers, play an important role.


CNNs have been dominant in image tasks for the past few years. This is largely due to the recursive method of spatial feature preservation by convolution. It was also readily available for transfer learning and could be used to rotate a generic visual representation. However, recent studies have shown that ViT outperforms CNNs in image classification tasks, as it can aggregate global features by using the self-attention mechanism of transformers used in natural language processing. This is in contrast to CNNs, which use an inductive approach called convolution to aggregate information, and differs from previously reported ways to improve CNNs.

In this paper, we analyze how ViT performs image tasks as follows.

  • ViT has different overall features than ResNet because it captures global features at lower layers.
  • However, capturing local features in the lower layers is still important, and even ViT learns these lower-layer attentions when learning on large data sets.
  • The skip-coupling of ViT is more influential than ResNet and has a significant impact on the performance and expressiveness of the model.
  • Considering the use of ViT for object detection instead of classification tasks, we analyzed the extent to which the input spatial information is preserved.
  • The effect of dataset size on transition learning is investigated using the linear probe method, and its importance for high-quality intermediate representations is revealed.

Related Work

The development of Transformers for image tasks is an active research area. Prior work has analyzed how attention captures local features by combining CNNs and attention or by reducing the image size. However, there has been little research on devising ViT architectures, and even less work on comparing ViT with CNNs (although this has been reported for text tasks).

Background and experimental setting

In this study, we compare CNN and ViT to see if there is a difference in the way they solve image tasks. Here we use ResNet as the CNN. Unless otherwise specified, the dataset is JFT-300M. See Appendix A for details.

Expression similarity and CKA

For hidden layer representation analysis we use centered kernel alignment (CKA), where the input is the representation (activation matrix) of the two layers (X and Y) for the same image. We define K and L as follows.

This is called the Gram matrix. the CKA is calculated as follows

HSIC is a measure called the Hilbert-Schmidt independence test.

Figure 1. the CKA is plotted in the above figure. we use two types of ViT: ViT-L/16 and ViT-H/14. the same is true for ResNet. we use two types of ViT: ViT-L/16 and ViT-H/14. Let's look at the diagonals first. The diagonals are white regardless of ViT and ResNet because CKA is maximized when comparing the same layer.

And in ViT, you can see that the overall color is orange to white. This means that the output of the lower and upper layers are similar. On the other hand, when we look at ResNet, the plot diagram looks like a checkered pattern of orange and purple. This is because the similarity between one layer and another layer is uneven, with some layers having dissimilar expressions (purple) and others having similar expressions (orange).

This reveals that the expressivity acquired by ViT is similar in the lower and upper layers. Neural networks learn global (global) features from local (local) features as they move toward the upper layers. Therefore, ViT can be considered to have captured global features from the beginning, and conversely, it can be said that ViT does not acquire different representations even if it is layered.

In figure 2. unlike Figure 1, we plot the similarity of the representations with ViT and ResNet as axes. for ViT and ResNet, the similarity appears to be high in layers 0-20 of ViT and 0-60 of ResNet. the same is true for layers 40-60 of ViT and 80-120 of ResNet. Conversely, there is no similarity in the upper layers of the two models.

Taken together with Figure 1, this suggests that ViT and ResNet have different styles of image abstraction.

Local and global information in layer representation

The self-attention layer is structurally very different from CNNs. the self-attention layer consists of multi-head attention, and for each head, we can compute the distance between the query and the attention. This gives us an idea of the extent to which the self-attention layer aggregates local and global information.

Above is a plot of the mean value in 5000 data for this distance. There are two types of entries, the lower layer (block0, block1) and the upper layer (block22, block23). It can be seen that the lower layer has a small Mean Distance and a wide range, i.e., both local and global information is obtained, while the upper layer has mostly global information.

Taken together, these results suggest that (i) the lower layers of ViT have a different expressivity than the lower layers of ResNet, (ii) ViT has a strong representation propagation between the lower and upper layers, and (iii) the highest layers of ViT have a completely different representation than ResNet.

Also, as shown in the above figure, when we reduce the training dataset in ViT (using ImageNet), it is clear that the model is not locally trained even in the lower layers. The performance of the model is then degraded and it performs worse than the CNN. In other words, in this case, the performance of the CNN is because it learns local features.

The above figure compares ResNet and ViT and shows that the similarity between ResNet and ViT decreases monotonically as the average distance (horizontal axis) increases. This indicates that the ViT upper layer and the ResNet lower layer learn quantitatively different features when the distance is large.

Propagation of Expressions via Skip Coupling

It has been shown that the representation of ViT is uniform, but how does the representation propagate?

We now experimentally trained ViT without skip connections. As shown in the figure above, the similarity of the representations (purple to black) disappeared (in addition, the performance of the model decreased), proving the importance of skip connections in ViT.

Spatial information and localization

The above figure shows how well ViT and ResNet retain spatial information. The feature maps of the arbitrary layers are used to calculate the CKA similarity: ViT (the two top rows) has yellow or blue color at roughly similar locations, and ResNet also seems to retain information at roughly similar locations but shows a wide range of spatial similarity. In other words, there are similarities even in unrelated positions.

This suggests that spatial information is also retained in the upper layers of ViT, which is useful in the object detection task.

Scale Effects

We have shown that the features learned by ViT vary with the size of the dataset.

The above shows the performance change of the models when the size of the dataset is varied. On the left is a comparison between a model trained on JFT-300M (solid line) and a model trained on ImageNet (dashed line). The task is a classification task on ImageNet.

There is no significant difference in the final result (1.0 on the horizontal axis), but the intermediate layer representation (around 0.5 on the horizontal axis) shows that ViT is more accurate. This suggests that ViT acquires a higher quality intermediate representation when trained on large-scale data (JFT-300M).

On the right is a comparison with ResNet. In the final result, there is not much difference between ViT and ResNet, but ViT is superior in the intermediate representation. This is also true for CIFAR-10 and CIFAR-100.


Compared to CV breakthroughs, CNNs, it is marvelous that Transformer generated from natural language performs as well as CNNs on image tasks.

In this paper, we compare CNN (ResNet) and ViT and show that there is a surprisingly clear difference in their internal structures: we found that ViT aggregates global information from the beginning and propagates it strongly to the upper layers using skip connections. We also found that ViT requires a large pre-training dataset, which allows it to obtain a high-quality intermediate representation. These findings answer questions about the differences between ViT and CNNs and point to new directions for future research.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us