The Embarrassingly Simple Vision Transformer

Transformer 04/01/2022

3 main points
✔️ The heart of ViT is a meta-structure called MetaFormer
✔️ Proposed PoolFormer which uses a Pooling layer without parameters
✔️ PoolFormer achieves higher accuracy than comparison methods with fewer parameters

MetaFormer is Actually What You Need for Vision
written by Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan
(Submitted on 22 Nov 2021 (v1), last revised 29 Nov 2021 (this version, v2))
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

One of the hot topics of 2021 was Vision Transformer (finally, ViT came to the field of video recognition), which was all about updating SOTA in various tasks. Since then Why is ViT doing so well? there has been a lot of analysis about

Look at the architecture on the right of the reference figure. In general, Attention to the Transformer structure is considered to be an important part of token information mixing. Therefore, since the computational complexity of Attention, O(N^2), is a bottleneck, a lot of research has been done to reduce the computational complexity.

However, in March 2021, researchers at Google revealed that it is possible to achieve SOTA-like accuracy simply using MLP, not Attention (does Transformer work without Attention?). After that, the research on Transformer using MLP became more and more popular. What is intrinsically important about ViT? This further deepened the mystery of

In this article, we present research that sought to answer the question 'Is the architecture of token information mixing the key to ViT's success?' This article describes a study that sought to answer this question.

The shocking result, though. 'I don't care about Token information mixning! I don't care about token information mixning, as long as I can share information! It has been shown that The effectiveness of PoolFormer with this embarrassingly simple (the authors describe it as embarrassingly simple) Pooling layer is shown through several experiments.

In the following section, we introduce the proposed method in section 2, and then we present the key experiments in section 3.

proposed method

MetaFormer

In the paper, we proposed a new concept called MetaFormer, shown in Figure (a).

MetaFormer is an architectural concept, not concrete architecture. In other words, all the meta-structures that satisfy the meta-structure shown in (a) MetaFormer, e.g., the Transformer using the Attention structure that has been studied so far, the MLP-like model using MLP, and the PoolFormer using the Pooling layer proposed in this research, are called We call them MetaFormers.

MetaFormer is very simple. First, Patch Embedding is performed on input I to obtain X (Equation 1).

Next, we regularize X and input it into the Token Mixer, which fuses the information between each Patch of X. We use the residual connection to obtain Y (Equation 2). Here, we use the residual connection to obtain Y (Equation 2).

Furthermore, after regularizing Y and passing it through the MLP and activation functions, the input and output are connected by a residual connection (Equation 3).

PoolFormer

MetaFormer is analogous to the notion of an abstract class in the Python language; PoolFormer is a Pooling layer implementation of TokenMixer in (Equation 2) and is just one instantiation of MetaFormer.

Pooling simply outputs the average of the K*K values. However, the subtraction of T at the end of (Eq. 4) is intended to adjust for the residual connection (+X in Eq. 2) to be performed later. As can be seen from Equation 4, we have replaced the initial computational complexity of O(N^2) Attention with Pooling without learning parameters.

This is the end of the introduction of the architecture proposed in this paper, and I hope you have also realized that it is stupendously simple. Finally, we include the sample code of Pytorch and Figure 2, which visualize the change of type from input to output, for your reference.

Experiment: Validating the effectiveness of MetaFormer

Image classification

In our experiments, we used ImageNet-1K (1K classification with 1.3M training images and 50K validation images) with four different data extensions: MixUp, CutMix, CutOut, and RandAugment. rate=0.05, batch size=4096, learning rate lr = 0.001 * batch size / 1024, we trained 300 epochs. We also used a cosine schedule with warmup epochs = 5 to attenuate the learning rate. For more information about label smoothing, please refer to (The truth behind label smoothing!). for more information about label smoothing.

After a long introduction to the experimental setup, the experimental results are clear as shown in Table 2, where we group the models by Token Mixer. We group the models by Token Mixer, and the evaluation metrics are Params (M), MACs (G), and Top-1 accuracy (%). RSB-ResNet is an improved version of the 'ResNet Strikes Back' training method, which has been trained for 300 epochs.

In the last block of Table 2, we show that we can achieve high accuracy with a small number of parameters. For example, even a small PoolFormer with 21M and 31M parameters achieves a Top-1 accuracy of 80.3% and 81.4%, exceeding many Attention-based and MLP-based Metaformers.

Furthermore, from the visualization in Figure 3, it is clear that PoolFormer achieves higher accuracy than the comparison method while having fewer computational MACs and parameters.

Object detection and instance segmentation

Next, we tested PoolFormer on the COCOva12017 dataset with 110K training data and 5K evaluation data for the COCO Object Detection and Instance Segmentation tasks, respectively. For the COCO tasks of Object Detection and Instance Segmentation, we experimented with PoolFormer (Backbone), which has RetinaNet and Mask R-CNN as its head.

Table 3 and Table 4 show the results of each. All of the evaluation indices show "higher accuracy with fewer parameters" than the baseline ResNet.

Semantic segmentation

We also experimented with semantic splitting. We use ADE20K dataset and Semantic FPN as a splitting head. In all conditions, PoolFormer achieves "higher accuracy with fewer parameters" than ResNet and other comparison methods.

Ablation studies

Finally, we performed ablation experiments on the Pooling layer, the regularization method, and the activation function.

One important result is that 74.3% accuracy can be achieved even if the Pooling layer is changed to Identity mapping. In addition, we can infer that the meta-structure of MetaFormer is more important than the details of PoolFormer because there was no significant change even if we changed other parts.

At the end of the experiment, we show that the new architecture obtained by combining the Pooling layer with Attention and MLP may lead to improved accuracy, making it a future development study.

summary

What do you think? While there is a lot of research being done to analyze ViT, this new concept of MetaFormer, and the simplicity of PoolFormer has had a great impact on the community. 2021 was very exciting for ViT, but will we get its conclusion in 2022? We're looking forward to it.