Catch up on the latest AI articles

What Are The Attractive Characteristics Of Vision Transformers?

What Are The Attractive Characteristics Of Vision Transformers?


3 main points
✔️ Compares vision transformers (ViT) and CNNs
✔️ Investigate properties related to robustness to occlusions and perturbations, and shape bias
✔️ Investigate the effectiveness of features in downstream tasks

Intriguing Properties of Vision Transformers
written by Muzammal NaseerKanchana RanasingheSalman KhanMunawar HayatFahad Shahbaz KhanMing-Hsuan Yang
(Submitted on 21 May 2021 (v1), last revised 25 Nov 2021 (this version, v3))
Comments: NeurIPS 2021

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Vision transformers (ViTs) have shown excellent performance in a variety of computer vision tasks. In the paper presented in this article, we conducted a detailed study of the differences between CNNs and ViT in terms of robustness and generalization performance for three methods, ViT, DeiT, and T2T, and discovered some attractive properties of ViT. Let's have a look at them below.

On the Robustness of Vision Transformers to Occlusion

First, to investigate the robustness of ViT against occlusions (i.e., blockages), we perform an experiment when a part of the image is missing. (In the following, unless otherwise mentioned, we will refer to all variants of ViT, DeiT, T2T, etc. collectively as ViT.) Here we use simple masking as an example of occlusion.

First, we consider the case where the label of the input image $x$ is $y$ and $x$ consists of $N$ patches (ViT usually divides the image into 16x16 14x14 patches). Now we create an occlusion image $x'$ where some of the patches ($m<n$) have their pixel values set to zero (this technique is named PatchDrop).

The image generated here will look like this

(In this image, it is translucent black, but in reality, it will fill black.)

In addition, three major techniques are applied for masking.

  1. Random PatchDrop: Selects and drops a random batch of $M$.
  2. Salient(foreground) PatchDrop: Uses DINO to drop a set of patches containing the top $Q$% foreground information in the image. (As in the example image, this $Q$% does not necessarily correspond to a percentage of the pixel count.)
  3. Non-Salient(background) PatchDrop: Use DINO to drop a set of patches containing the lower $Q$% foreground information in the image. (As in the example image, this $Q$% does not necessarily correspond to a percentage of the pixel count.)

We investigate whether we can make a correct prediction $f(x')_{argmax}=y$ for the occlusion image $x'$ created in this way.

If we define Information Loss as the fraction of missing patches $\frac{M}{N}$, then the robustness of ViT to occlusion is as follows.

Comparing ViT with ResNet50, a strong baseline among CNN models (Figure left), the ViT model shows high robustness to CNN models.

For example, when Random PatchDrop deletes 50% of the total image, ResNet50 (with 23M parameters) has an accuracy of 0.1%, whereas Deit-S (with 22M parameters) has an accuracy of 70%. These results are consistent across the ViT architectures studied, indicating that ViT exhibits excellent robustness to random, foreground, and background masking.

Additional research

To investigate the behavior of ViT towards occlusion in more detail, the visualization of the Attention Map at each layer is shown below.

Here, the masked area is shown in the figure below.

In this figure, the entire image is the focus of attention in the early layers, whereas the deeper the layer, the more attention tends to be focused on regions without occlusion.

Furthermore, the correlation coefficients are calculated for how much the CLS tokens and features in ViT change with the presence of occlusion.

The table shows the correlation coefficients for CLS tokens and the figure shows the correlation coefficients for each superclass.

In general, we found that ViT does not change its representation significantly in the presence of occlusion and has more robust features.

Can ViT capture shape and texture?

Next, we investigate ViT's ability to grasp shape and texture.

Learning without local texture

We first investigate the case where the ViT model is trained on a dataset where no information about the local texture is available.

In this section, we create a dataset (SIN) from ImageNet with local texture information removed and use this dataset to train the ViT model. At this time, we do not use any method such as data augmentation so that the information about the shape does not change.

The results of our analysis of the model's shape bias (the proportion of correct judgments based on the shape of the object) are shown below.

In the left figure, the shape/texture bias trade-off is shown.

We can see that ViT tends to show a higher shape bias than the CNN model, while the model trained on the regular dataset is biased towards texture.

Among the models trained by SIN, ViT shows a shape bias that is quite close to human judgment, suggesting the high ability of ViT to capture shapes. The right figure also shows the shape bias of the various models, and it can be seen that ViT exhibits a higher shape bias than ResNet.

Further properties of ViT with high shape bias

An attractive property of ViT with enhanced shape bias is that ViT can automatically perform foreground segmentation by focusing strongly on foreground objects in the image.

Here, (Distilled) in the figure refers to the Distillation of ViT with the Shape token added by the following procedure (Shape Distillation, see the original paper for details). The results of the Jaccard coefficients of the ground-truth and the obtained segmentation maps can be summarized in the following table.

These results show that ViT can have an excellent shape bias that is close to human capabilities.

On Robustness to Natural and Hostile Perturbations

Next, we investigate the robustness to perturbations such as rain, fog, snow, and noise. In this case, the mCE (mean Corruption Error) of ViT or CNN is as follows (the smaller the number, the better).

The results for the adversarial patch attack and sample specific attack are shown below.

In general, ViT was shown to be more robust to natural and adversarial perturbations than CNN.

On the effectiveness of ViT for feature extraction

Finally, we investigate the effectiveness of using ViT as a backbone for feature extraction.

Specifically, we concatenate the CLS tokens of each block of ViT to train a linear classifier.

At this time, the results in the Image Classification or Few-Shot Learning benchmarks are as follows.

DeiT-S(ensemble) in the figure shows the use of CLS tokens in the last four blocks. In general, we found that ViT is also effective for feature extraction for downstream tasks.


The Transformer vs. CNN debate has been active since the introduction of the Transformer for visual tasks.

In this article, we discussed a paper that showed the advantages of ViT over CNNs, including robustness to occlusions and perturbations, properties related to shape bias, and effectiveness as a feature extractor. Recently, there have been examples of MLP-based models showing superior performance, and we look forward to further discussions on architectural differences in visual tasks.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us