[MLP-Mixer] The Day MLP Surpasses CNN And Transformer

Image Recognition 10/06/2021

3 main points
✔️ Achieve image classification performance comparable to SoTA with a simple architecture using only multilayer perceptron (MLP)
✔️ Iteratively mixes location-specific features and spatial information
✔️ Achieve high accuracy while simultaneously reducing the computational cost

MLP-Mixer: An all-MLP Architecture for Vision
written by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
(Submitted on 4 May 2021 (v1), last revised 17 May 2021 (this version, v2))
Comments: Accepted by arXiv
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

first of all

In the field of computer vision, Convolution Neural Network (CNN ) has been the most popular standard for a long time, and recently, Attention-based networks such as Vision Transformer (ViT ) have also attracted attention. However, in this paper, we focus on Convolutional Networks. However, in this paper, we show that both Convolution and Attention are not necessary. The MLP-Mixer consists of two layers: the first is an MLP that is applied to each image patch (to mix positional features), and the second is an MLP that is applied across image patches (to mix spatial information). The MLP-Mixer achieves image classification performance comparable to SoTA by using a sufficient amount of datasets and normalization methods.

architecture

First of all, let me explain the whole image of Mixer shown above. The image is divided into 16x16 patches as input. Next, linear embeddings are applied to each patch. Then, we iteratively apply the Mixer Layer to them. Finally, we classify the image by applying Global Average Pooling as in general CNNs.

Next, we explain the contents of the Mixer Layer shown in the figure above. The mixer uses two types of MLPs: channel-mixing MLPs and token-mixing MLPs. token-mixing MLPs are responsible for mixing features between different spatial locations ( tokens ). The token-mixing MLP is responsible for mixing features between different spatial locations ( tokens ) and is applied to each channel independently, treating each column of the table as an input. On the other hand, channel-mixing MLP is responsible for mixing features between different channels, is applied to each token independently, and treats each row of the table as an input.

In the extreme case, channel-mixing MLP can be regarded as a CNN with 1×1 convolution, while token-mixing MLP can be regarded as single-channel depth-wise convolutions. However, this MLP-Mixer has a much simpler architecture than such CNNs.

In addition, let us convert token-mixing MLP and channel-mixing MLP into mathematical expressions. The structure of MLP-Block is shown below. It consists of all join layers + GELU + all join layers, which is much simpler than CNN and Transformer. In addition, each MLP is connected by a skip-connection like ResNet. In addition, each MLP is preceded by Layer Normalization.

token-mixing MLP

U: Output of MLP block of token-mixing.
X: Feature value obtained by dividing the image into patches.
W1:First full join layer of MLP block
W2: The second all-connected layer of MLP block.
σ: GELU function of MLP block

channel-mixing MLP

Y: output of MLP block of channel-mixing
U: output of the MLP block of token-mixing
W3:First all combining layer of MLP block
W3: The first all-combining layer of the MLP block - W4: The second all-combining layer of the MLP block
σ: GELU function of MLP block

Note that the GELU activation function used here is also used in well-known natural languages processing models such as GPT and BERT; like Dropout, we make the model robust by randomly multiplying some activations by 0.

This is the architecture of Mixer. You can see that all the operations are completed with very basic operations such as matrix product and transpose.

experiment

In our experiments, we perform pre-training on a large dataset and then fine-tuning on a small dataset (downstream task) as in the conventional method. We have the following three objectives.

1. accuracy in downstream tasks
2. the computational cost of pre-training
3. throughput during inference

Note that the goal of this paper is not to achieve SoTA, but to show that MLP-based models have the potential to match and exceed today's CNNs and Transformers.

downstream task

・ImageNet
・CIFAR-10/100
・Oxford-IIIT Pets
・Oxford Flowers-102
・Visual Task Adaptation Benchmark

prior learning data

・ImageNet-21k
・ImageNet-21k
・JFT-300M

Mixer model details

result

Pre-training with ImageNet-21k achieves the same level of accuracy as Visual Transformer and BiT, while at the same time keeping the computational cost low.

Moreover, as shown in the figure above, MLP-Mixer improves the accuracy of image classification as the dataset size increases, and the degree of improvement is larger than other models. When we increase the dataset size to 300M, the accuracy is higher than BiT and faster.

The above figure also shows the performance of the trade-off between accuracy and computational cost, and it can be seen that Mixer is on the Pareto Frontier with other SoTA models when the trade-off between accuracy and computational cost is taken into account, although its accuracy alone is slightly lower than SoTA.

summary

Although Mixer is a very simple architecture consisting of two MLP layers, it achieves image classification performance comparable to the SoTA model while keeping the computational cost low. If the research is developed based on this paper, it may soon surpass CNN and Transformer in both classification accuracy and computational cost.