[MobileViT] Lightweight ViT That Can Be Used With Mobile Phones

Image Recognition 05/11/2021

3 main points
✔️ A lightweight model that takes the best of CNN and ViT and makes it mobile-friendly
✔️ Doesn't require the complex data padding needed to train ViT
✔️ Can be used as a backbone for a variety of tasks, achieving SOTA in all experiments

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
written by Sachin Mehta, Mohammad Rastegari
(Submitted on 5 Oct 2021)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

The transformer has been applied in a wide range of natural language processing tasks including BERT since it was proposed in machine translation, and Vision Transformer (ViT), applied in the image recognition field, continues to produce SOTA in various tasks.

However, ViT does not have a mechanism that can detect local features as well as CNNs, so it requires a large dataset and a larger number of parameters than CNNs to achieve better performance than CNNs, which has been a bottleneck in using ViT on mobile devices. On the other hand, ViT's self-attention mechanism is capable of considering global relations and is attracting attention as an alternative architecture to CNNs.

mobileViT is a lightweight model that is a hybrid of CNN, which is good at local detection, and ViT, which is good at global information processing. mobileViT is a model that not only achieves better performance with fewer parameters than CNN but also achieves accuracy with basic data padding. mobileViT is a lightweight model. The paper also reports that mobileViT performs best with similar parameters as a backbone for tasks such as object detection and semantic segmentation.

Thesis Outline

mobileViT is different from traditional ViT in two ways.

Accuracy with the same number of parameters as a CNN.
No need for diverse data padding.

Accuracy with the same number of parameters as a CNN.

It is known that the Attention mechanism has the property of being able to consider the information in distant patches, but it does not have the property of aggregating local information that convolutional layers have, so it requires more parameters to have the same level of performance as a CNN. For example, for semantic segmentation, it takes 345M for ViT-based DPT and only 59M for CNN-based DeepLabv3 to achieve the same level of performance. However, mobileViT can achieve the same level of accuracy as CNN and even better.

No need for diverse data padding.

ViT tends to over-learn and requires a variety of data padding. In addition, ViT is susceptible to L2 regularization and is difficult to learn. However, in this paper, only basic data padding, such as random resized cropping and horizontal cropping, is sufficient to obtain good results.

model architecture

The architecture of mobileViT is as follows: MV2 is a mobilenetv2 block, where ↓2 represents downsampling; local features are captured by MV2, and global relationships are found by the novel proposed block, MobileViT block.

MobileViT Block

In the MobileViT block, three processes are performed.

Local representations
Transformer as Convolutions
Fusion

Local Representations

The input tensor X ∈ ^{RH × W × C}istransformed into the higher-dimensional tensor_XL∈ ^{RH × W × d}in a point-wise convolutional layer via an n × n normal convolutional layer. d is greater than C. In the implementation code, a 3× 3convolutionallayerwith padding of 1 is used as the n × n normal convolutional layer.

Transformer as Convolutions

The _XLcontaining local information is then split into patches and transformed into_XU∈ ^RP^{× N × d}, where N is the number of patches and P is a patch with width w and height h. The _XL is then passed to the Transformer for each patch pixel.
Each pixel of the patch is then passed to the transformer.

These images are shown in the image below. The arrows pointing from the red pixel to the blue pixel in the other patch correspond to the global relationships we see in the Transformer. The blue arrows coming from the blue pixels represent the local representations that have been passed through the Local representations to fill the pixels with local information.

The _XG∈^RP^{× N × d} thus obtained is again transformed into _XF∈ ^{RH × W × d}.

Fusion

The reordered tensor _XF is reduced to dimension C by point-wise convolution to become ^{RH × W × C}, which is then combined with the original tensor X ∈^{RH × W × C} to become ^{RH × W × 2C}, which is then reduced to ^RH × ^{W × C} by n × n convolution.

The author regards the mobileViT block as a transformer (Transformer as Convolution) that performs the same function as a convolutional layer. The reason for this is that the convolutional layer decomposes features into parts, performs matrix operations, and accumulates the results, which is common to the mobileViT block.

experiment

Comparison with CNN (Accuracy)

We trained from scratch on ImageNet-1k and compared the performance of various models with the proposed model as follows. The lightweight model with similar parameters CNN model with similar parameters, and even better accuracy with fewer parameters than the heavy CNN model. with fewer parameters than the heavy CNN model. This is a good result. It is also 2.1% more accurate than the much-talked-about EfficientNet of 2019.

Comparison with other ViT (accuracy)

It can be seen that mobileViT is efficient when compared to various ViTs. As you can see, it achieves the best performance even though it has the least number of parameters. Not only that. but also by using only the basic data padding methods, random resized cropping, and horizontal cropping.
The piT is an advanced data padding method (see R4 and R17), but it is possible to achieve better performance than The fact that PiT can achieve better performance with basic data padding shows that it is easy to use.

Using MobileViT as a Backbone

object detection

Even in object detection, the best accuracy is obtained even though the number of parameters is the smallest.

In the experiment, the comparison is done on SSDLite, which is a lighter version of SSD. The difference between SSD and SSDlite is that the head of SSDlite is a separable convolution. The results of fine-tuning on the MS-COCO dataset using various ImageNet-1k trained models as the backbone are shown below.

semantic segmentation

Semantic segmentation also gives the best results with the minimum number of parameters.

The experiments have been compared in DeepLabv3. A trained model of ImageNet-1k is used as a backbone, and fine-tuning is performed on PASCAL VOC2012. The metric used is mIOU (mean intersection over union). The following is the result of the experiment, which shows that mobileViT can achieve the same level of performance as Resnet-101 even though it is one-tenth the size of Resnet-1k.

Multiscale sampler for learning efficiency

In normal ViT, fine-tuning is mainly used to train images of various resolutions. Since DeIT needs to interpolate positional embedding when the image size changes We fix the image size to 224x224.

MobileViT does not need to use positional embedding, so it does not need fine-tuning. MobileViT does not need to use positional embedding, so it does not need to perform fine-tuning, and training is performed with a multi-scale sampler that performs training at multiple resolutions.

In the multiscale sampler, the batch size is changed according to the resolution to use the GPU efficiently. The set of image resolutions to be used S={( _H1, _W1 ), ... . ( _Hn , _Wn ) } ( _Hn , _Wn ) is the one with the highest resolution, the resolution used in the tth iteration ( _Ht , _Wt ) for the following batch size _bt.

Processing speed on mobile devices

The experiments are compared with an average of 100 runs on iPhone12 of a full precision trained mobileViT converted to CoreML using CoreMLTools.
MobileViT and other ViTs have slower inference times than MobileNetv2 for two reasons.

Not using CUDA's kernel to support GPU's Transformer for scalability and efficiency.
The fact that CNN has a hardware-level optimization that combines batch regularization and convolutional layers, but Transformer does not have such a module

summary

ViT has been considered to require a larger number of parameters than CNN because it cannot capture local features, but the proposed mobileViT can achieve better performance with a similar number of parameters. In addition, mobileViT can perform well enough with basic data padding methods and has excellent performance as a backbone for object detection and semantic segmentation, and I think that future research will be based on mobileViT. However, it still has the problem of slow inference time compared to CNNs, and research to reduce the computational complexity of NLP's Transformer will be applied.