Catch up on the latest AI articles

Convolution Vs Transformer! To The Next Stage! Vision Transformer, A New Image Recognition Model

Convolution Vs Transformer! To The Next Stage! Vision Transformer, A New Image Recognition Model


3 main points. 
✔️ A new image classification model using the Transformer 
✔️ The input is characterized by segmenting the original input image and creating patches 
Records result equal to or better than state-of-the-art CNN models

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
written by 
Alexey DosovitskiyLucas BeyerAlexander KolesnikovDirk WeissenbornXiaohua ZhaiThomas UnterthinerMostafa DehghaniMatthias MindererGeorg HeigoldSylvain GellyJakob UszkoreitNeil Houlsby

(Submitted on 22 Oct 2020)
Comments: Accepted at arXiv
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
 Open In Colab 

first of all

Most of the existing image classification models have used convolutional neural networks (CNNs). However, the model presented this time does not use CNN at all, and is a model constructed using only a Transformer. The name of the model is Vision Transformer (ViT).

The transformer has now become a fact standard in the field of natural language processing since it was first announced in 2017, but its application to the field of image processing has been limited. We summarize how we have successfully adapted Transformer to the image classification task and what its accuracy is of interest.


Source: Attention Is All You Need

Transformer originally attracted attention for its own scalability. Normally, when a model is enlarged, learning may not proceed, but the Transformer has the feature that it can learn even when the model is enlarged. In addition, various studies have been conducted to improve the computational power of the Transformer.

However, the Transformer is good at sequence data, i.e. data with continuity, such as text. Moreover, Transformer calculates the similarity between elements of the input sequence, so the disadvantage of Transformer is that the number of elements to calculate becomes huge if you use it for each pixel of an image. (For example, an image with 224x224 pixels would require (224x224 )2 = 2,517,630,976 calculations.

To read more,

Please register with AI-SCHOLAR.

Sign up for free in 1 minute


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us