Catch up on the latest AI articles

Finally, BERT For Image Recognition? About MAE!

Finally, BERT For Image Recognition? About MAE!


3 main points
✔️ Aim at BERT of CV using Vision Transformer (ViT)
✔️ Proposed MAE that masks 75% of patches in Encoder and uses Transformer in Decoder
✔️ Representation trained on unlabeled data from ImageNet-1k achieves 87.8% accuracy for the first time

Masked Autoencoders Are Scalable Vision Learners
written by Kaiming HeXinlei ChenSaining XieYanghao LiPiotr DollárRoss Girshick
(Submitted on 11 Nov 2021 (v1), last revised 19 Dec 2021 (this version, v3))
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)



The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In the field of image recognition (CV), deep learning has been a great success (ViT, Vision Transformers' Research for the Future ). Labeled images have played an important role in this. On the other hand, in the field of Natural Language Processing (NLP), self-supervised learning has achieved tremendous success. For example, GPT-3, an autoregressive model with 100 billion parameters (how to extract the true value of GPT-3: Prompt programming ), and BERT, a Masked AutoEncoder model, are famous.

What is the origin of this difference between CV and NLP? In this article, we analyze the differences between language and images and introduce a paper that proposes a Masked AutoEncoder (MAE) to bridge the gap.

Denoising Autoencoder, which is one of Masked AutoEncoder in CV, has been studied for a while. But it hasn't been able to fill the fad like BERT in CV. We can think about the cause as follows.

1. architectural differences: until recently, CV has been dominated by convolutional networks CNNs, which extract information from each region of the image, making it difficult to incorporate masking mechanisms. However, this problem has been solved by Vision Transformer (ViT).

2. the difference in the density of information. Natural language was created by humans and has a high level of abstraction and high information density per unit. Therefore, in NLP, masking a word to make it predictable is a challenging task, but the information density of CV is sparse. Therefore, in this paper, we propose to apply a very high ratio of masks to images. It is expected to have the effect of encouraging the difficult prediction task to pay attention to a wider range of information.

3. the role of the Decoder in AutoEncoder is different: in NLP the Decoder uses a simple MLP to predict the word from the representation space, but in CV it has to reproduce the Pixel level. The Decoder in this paper uses a Transformer.

Based on the above three points, the authors propose Masked AutoEncoder (MAE) for image representation learning. If Vision Transfer (ViT) is a direct adaptation of the Transformer, MAE is a direct adaptation of BERT, and we expect CV to following the NLP revolution brought by BERT. Finally, let me show you the result of MAE (Figure 2).

Masked AutoEncoder (MAE)

MAE is a structure that reconstructs the rest of the input from a part of the input. It is an AutoEncoder (Figure 1) that projects the input into the latent space by an Encoder and from the latent space into the input space by a Decoder. We will explain each of them in this section.

We first divide the image into patches with no overlaps, and then randomly mask a high percentage (75%) of the patches. We then randomly mask a high percentage (75%) of the patches and input only the unmasked patches into the MAE encoder (ViT). Specifically, we perform linear projection and Positional Embedding on each patch.

MAE decoder takes as input (1) an encoded patch and (2) a mask token. The mask token is a commonly learned one and represents the patch to be predicted. The decoder can be any model since it is used only for pre-training, but MAE uses a lightweight transformer that is 1/10th as computationally expensive as Encoder. In addition, the Decoder outputs a 256-dimensional vector and transforms it to 16 x 16 before training it with the correct patch and MSE error.


We use ViT-Large/16 as the backbone of MAE. In our experiments, we improved the accuracy of ViT-L/16 from 76.5% to 82.5% with stronger restrictions than the accuracy of ViT-L/16 shown in previous studies. evaluation.

Ablation experiments on ImageNet

The results of the MAE ablation experiments are shown in Table 1.

(a) shows that the depth of the decoder has a small effect on the tf results, with lin (fine-tuning the last layer) giving the best results when blocks=8, and ft giving better overall accuracy than lin. The overall accuracy of ft is higher than that of lin, as shown in (b), where the width of the decoder is 512.

It is interesting to note that (c) shows that not using masked patches as input to the encoder is not only less computationally expensive but also more accurate. Figure 5 shows the percentage of masked patches. 40% or more is good for ft, and 75% masking is the best for lin.

(d) investigates the purpose of the reconstruction and shows that projecting onto discrete tokens, as in dVAE, is also accurate. However, regularization in pixel space gives equivalent accuracy, so we adopt the simple pixel space.

(e) examines data expansion and shows that simple random size clipping is good enough for accuracy.

From (f) we can see that randomly sampling the mask gives a good result, Figure 6 shows a visualization of the different types of masks.


Table 2 examines the computation time and shows that the speedup is about 4 times faster while achieving the same accuracy when the depth of Decoder is 1. Finally, Figure 7 shows that the accuracy still increases with the number of pre-training epochs.

Comparison experiment on ImageNet

Table 3 shows the accuracy of the comparison methods and it can be seen that MAE is the best.

Figure 8 shows the differences in the pre-training data of ViT. The most accurate representation is the one trained with JFT300M, and the closest to it is the MAE trained with Image-1K. This shows the effectiveness of MAE trained with fewer data and without using labels.

Partial Fine-tuning Experiments on ImageNet

In Figure 9, we examine the number of fine-tuning layers, and we see that fine-tuning up to four layers contributes significantly to the accuracy, indicating that the representation of the top four layers is highly correlated with the task.

Transfer Learning Experiments on Other Data Sets

Since the goal of representation learning, such as MAE, is to adapt to downstream tasks, we conducted experiments on non-ImageNet classification tasks (Table 6), COCO object detection (Table 4), and the ADE20K segmentation task (Table 5).

Both tables show that MAE has reached SOTA.


This article introduced Masked AutoEncoder (MAE), we used ViT to implement a BERT-like model in image recognition. It has two features: the Encoder masks patches with a high rate (75%), which increases the task difficulty. The decoder uses a transformer to make pixel-level predictions. We also incorporated techniques from ViT and beyond and found that pre-trained representations on the ImageNet-1K unlabeled dataset outperformed the supervised learning ViT by 87.8%.

We believe that this research has made a significant contribution by showing that it is possible to learn good representations without using labels, and in particular by showing the potential of generative models in image representation learning, and we have introduced MAE.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us