IBOT Aiming For CV BERT Through Self-distillation

Transformer 05/01/2022

3 main points
✔️ Show that ViT's Image Tokenizer is important
✔️ Learn ViT's Image Tokenizer with self-distillation to achieve End-to-End Masked Image Modeling
✔️ Achieve SOTA on ImageNet-1K and compete with MAE for BERT status in image recognition

iBOT: Image BERT Pre-Training with Online Tokenizer
written by Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong
(Submitted on 15 Nov 2021 (v1), last revised 9 Dec 2021 (this version, v2))
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

As a result of the great success of the Transformer (BERT) in natural language processing, research on introducing the Transformer to image recognition, including Vision Transformer, has been accelerated. In particular, MAE (Finally, BERT in Image Recognition?! About MAE), which is inspired by Masked Language Modeling (MLM), a key technique in BERT, has achieved high accuracy in image tasks. About MAE ) was published recently.

In MLM, a tokenizer, which projects words into a meaningful latent space, is a very important technique. Similarly, in MIM, it is necessary to study what kind of tokenizer is needed to project image patches into meaningful latent space. In particular, it is more difficult to transform continuous and redundant image pixels into tokens with high dimensional semantics than to model meaningful words.

This article is about iBOT, which was released four days after MAE and tackled some of the difficulties of the MIM Tokenizer.

While similar to previous work BEiT used a pre-trained dVAE as Tokenizer, iBOT proposes an Online Tokenzier using a self-distilling framework to achieve end-to-end MIM. More details about the model will be presented later, so we first put the results in Figure 1 as soon as possible. The success rate of ImageNet-Top1 is higher than that of DINO, which shows a surprisingly clean Attention Map.

iBOT

About Masked Image Modeling

MIM samples a random mask with ratio r for the image Token series . N here is the number of Tokens. The masked is replaced by to obtain the image Token sequence . The objective function of MIM is to recover the original image from the masked The objective function of MIM is to recover the original image from the masked Token, which is defined in BEiT as in equation (1).

is the model that transforms the input into a K-dimensional probability distribution, and and are the discrete VAE and the model we want to train, respectively. parameters of the model to be trained. Here, the discrete VAE transforms image patches into K categories.

About self-distillation

Different will contain different prior knowledge. For example, BEiT uses a trained discrete VAE, while iBOT uses a self-distilled . In other words, the output of the model is used as teacher data to train the model itself. Here, we will perform self-distillation by simply preparing two models with the same network but different parameters. Specifically, two data extensions yielded the input images to . We take the prediction loss function as shown in equation (2) with each prediction as and .

can predict each other because they were obtained from the same input image . Here, the student and teacher networks have the same structure with different parameters and . Also, the parameter of the teacher network is the exponential moving average of the parameter of the student network.

About iBOT's architecture

Figure 3 shows an overall view of iBOT. iBOT plays the role of a tokenizer by self-distilling. In other words, iBOT learns MIM by self-distillation. Specifically, u and v, which are obtained from image x by data expansion, are input to the teacher and student networks. Each network has a backbone and a patch prediction head, but the teacher network is not updated using gradient descent. Instead, it is shown that updating the student network with an exponential moving average (EMA) of its parameters can transform the image into a continuous feature distribution.

iBOT has two objective functions. The first is the predictive loss function by self-distillation in equation (2); it predicts the classification tokens in cross-view. The second is shown in equation (3). Using the output of the teacher network as a label, the student network recovers the masked patch.

Through later experiments, we show that the accuracy is better when the parameters are shared between the classification token and the patch recovery head, as in . It is also shown that iBOT is more accurate when the token distribution after softmax is used as the teacher signal instead of the one-hot token id.

experiment

Experimental results on ImageNet-1K

We use five indicators to evaluate the quality of the expressions learned by iBOT.

Table 1 shows the results of k-NN and linear probing. k-NN shows the accuracy when the feature vector obtained by fixing the Backbone is applied to k-nearest classification. linear probing shows the accuracy when the Backbone is fixed and the classification is performed with a one-layer linear classifier. When the backbone is ViT-S/16 or ViT-B/16, both k-NN and linear probing outperform DINO (+~1. 3%). In addition, as shown in the last line, the linear probing index reaches 81.6% as a result of pre-training using ImageNet-22K data.

Table 2 shows the results of fine-tuning with ImageNet-1K and Table 3 shows the results of pre-training with ImageNet-22K. From Table 2, we obtain 82.3% accuracy when the backbone is ViT-S/16, and 83.8% accuracy when the backbone is ViT-B/16, which is higher than MAE's 83.6%. Table 3 also shows that pre-training on ImageNet-22K contributes to the improvement inaccuracy.

Table 4 shows the results of semi-supervised learning, which is a measure of label efficiency because semi-supervised learning uses some labels (1%, 10%) for fine-tuning. Table 4 shows that Semi-supervised learning is more accurate than DINO (SOTA) in all conditions.

Furthermore, Table 5 shows the results of unsupervised learning. We use standard metrics such as Accuracy (ACC), Adjusted Random Index (ARI), Normalized Mutual Information (NMI), and Fowlkes-Mallows Index (FMI). iBOT outperforms traditional SOTA (DINO) by 2.0% for Acc and 1.8% for NMI.

Above. The results of iBOT on ImageNet-1K suggest that MIM is capable of extracting good and visually meaningful features.

Experimental results with Downstream Tasks

The goal of MIM is to learn representations with good accuracy in diverse tasks. Here, we present experimental results on object detection and instance segmentation on the COCO dataset and the semantic segmentation task on ADE20K in Table 6.

It can be seen that iBOT has better accuracy (+ 0.8%~3.8%) than any of the comparison methods including MAE. In addition, the fact that iBOT is better than the results of Supervised Learning means that iBOT has surpassed MIM and other Self-Supervised Learning methods and has reached a practical level.

Finally, the accuracy of Transfer Learning on diverse datasets is shown in Table 7, and we can say that iBOT is superior because of its good results on so many datasets.

About the nature of iBOT

What feature representations have been learned by Patch Token in MIM? Answering this question leads to the question of whether meaningful tokenizers can be learned. We show an example visualization in Figure 4, but there are a wealth of analytical experiments in the paper that provide some very interesting insights. If you are interested, please refer to Section 4.3 and the Appendix of the paper.

Figure 4 shows the probability distribution of the Patch Token of the ImageNet-1K evaluation data and visualizes some of the central patterns. We can see that the two patches on the left are the light and the dog ear, which are close to the same class. The two on the right are patches with similar patterns, which suggests that iBOT was able to learn information about textures.

summary

In this paper, we propose iBOT, a Masked Image Modeling (MIM) model for Vision Transformer, and focus on the importance of tokenizers that can capture the semantics of images. Unlike BEiT's discrete VAEs, iBOT proposes a framework for learning tokenizers through self-distillation and shows its effectiveness through a large number of experiments.

The thought that a simple combination of iBOT and MetaFormer (the embarrassingly simple Vision Transformer ) could yield a powerful yet lightweight image recognition model makes me even more excited about future developments.