Catch up on the latest AI articles

The Embarrassingly Simple Vision Transformer

The Embarrassingly Simple Vision Transformer


3 main points
✔️ The heart of ViT is a meta-structure called MetaFormer
✔️ Proposed PoolFormer which uses a Pooling layer without parameters
✔️ PoolFormer achieves higher accuracy than comparison methods with fewer parameters

MetaFormer is Actually What You Need for Vision
written by Weihao YuMi LuoPan ZhouChenyang SiYichen ZhouXinchao WangJiashi FengShuicheng Yan
(Submitted on 22 Nov 2021 (v1), last revised 29 Nov 2021 (this version, v2))
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

One of the hot topics of 2021 was Vision Transformer (finally, ViT came to the field of video recognition), which was all about updating SOTA in various tasks. Since then Why is ViT doing so well? there has been a lot of analysis about

Look at the architecture on the right of the reference figure. In general, Attention to the Transformer structure is considered to be an important part of token information mixing. Therefore, since the computational complexity of Attention, O(N^2), is a bottleneck, a lot of research has been done to reduce the computational complexity.

However, in March 2021, researchers at Google revealed that it is possible to achieve SOTA-like accuracy simply using MLP, not Attention (does Transformer work without Attention?). After that, the research on Transformer using MLP became more and more popular. What is intrinsically important about ViT? This further deepened the mystery of

In this article, we present research that sought to answer the question 'Is the architecture of token information mixing the key to ViT's success?' This article describes a study that sought to answer this question.

The shocking result, though. 'I don't care about Token information mixning! I don't care about token information mixning, as long as I can share information! It has been shown that The effectiveness of PoolFormer with this embarrassingly simple (the authors describe it as embarrassingly simple) Pooling layer is shown through several experiments.

In the following section, we introduce the proposed method in section 2, and then we present the key experiments in section 3.

proposed method


In the paper, we proposed a new concept called MetaFormer, shown in Figure (a).

MetaFormer is an architectural concept, not concrete architecture. In other words, all the meta-structures that satisfy the meta-structure shown in (a) MetaFormer, e.g., the Transformer using the Attention structure that has been studied so far, the MLP-like model using MLP, and the PoolFormer using the Pooling layer proposed in this research, are called We call them MetaFormers.

MetaFormer is very simple. First, Patch Embedding is performed on input I to obtain X (Equation 1).

Next, we regularize X and input it into the Token Mixer, which fuses the information between each Patch of X. We use the residual connection to obtain Y (Equation 2). Here, we use the residual connection to obtain Y (Equation 2).

Furthermore, after regularizing Y and passing it through the MLP and activation functions, the input and output are connected by a residual connection (Equation 3).


MetaFormer is analogous to the notion of an abstract class in the Python language; PoolFormer is a Pooling layer implementation of TokenMixer in (Equation 2) and is just one instantiation of MetaFormer.

Pooling simply outputs the average of the K*K values. However, the subtraction of T at the end of (Eq. 4) is intended to adjust for the residual connection (+X in Eq. 2) to be performed later. As can be seen from Equation 4, we have replaced the initial computational complexity of O(N^2) Attention with Pooling without learning parameters.

This is the end of the introduction of the architecture proposed in this paper, and I hope you have also realized that it is stupendously simple. Finally, we include the sample code of Pytorch and Figure 2, which visualize the change of type from input to output, for your reference.

Experiment: Validating the effectiveness of MetaFormer

Image classification

In our experiments, we used ImageNet-1K (1K classification with 1.3M training images and 50K validation images) with four different data extensions: MixUp, CutMix, CutOut, and RandAugment. rate=0.05, batch size=4096, learning rate lr = 0.001 * batch size / 1024, we trained 300 epochs. We also used a cosine schedule with warmup epochs = 5 to attenuate the learning rate. For more information about label smoothing, please refer to (The truth behind label smoothing!). for more information about label smoothing.