[Swin Transformer] Transformer-based Image Recognition Models To Keep Now!

Image Recognition 22/03/2024

3 main points
✔️ Describes the Swin Transformer, which is often used as a baseline in recent computer vision research
✔️ Unlike the Vision Transformer, which computes the relevance (Attention) of all patches, the Swin Transformer computes Attention within a window of neighboring patches. Computes Attention in a window of neighboring patches
✔️ Computes Attention in different patch sizes, so features of various scales can be obtained.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
written by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo
(Submitted on 25 Mar 2021 (v1), last revised 17 Aug 2021 (this version, v2))
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Swin Transformer is a Transformer-based image recognition model published in 2021. Since it is often used as a baseline in recent computer vision research, we would like to reintroduce what kind of model it is.

Prior to the Swin Transformer, the Vision Transformer (ViT) was a Transformer-based image recognition model. It is the first application of the Transformer, which was used in natural language processing, to image recognition by considering images as sentences consisting of 16 x 16 words.

The paper presented in this article points out the difference between text and images and proposes the Swin Transformer, which adapts ViT more to the image domain.

The difference between text and images is twofold

Visual elements in images, unlike word tokens, vary widely in scale
Pixels in an image have higher resolution (more information) than words in a document

To accommodate these differences, the

Calculate Attention for different patch sizes
Calculate Attention with smaller patch sizes

We propose to do the following.

The figure below illustrates the differences between ViT and Swin Transformer in these respects.

Computing Attention with a small patch size allows for finer features, but it is also computationally expensive.

Therefore, Swin Transformer introduces Shifted Window based Self-Attention. Several patches are combined into a single window, and Attention calculation is performed only in that window to reduce the amount of calculation.

In the next section, we will look at the overall picture of the Swin Transformer and then at some of the finer details, including Shifted Window based Self-Attention.

Swin Transformer

Big picture

Here is an overall view of the Swin Transformer.

First, a Patch Partition is performed on the input image.

Patch Pertition: Segment 4x4 pixels into a single patch; since ViT uses 16x16 pixels as a single patch, the Swin Transformer can extract finer-grained features.

Next, Linear Embedding is performed.

Linear Embedding: Converts a patch (4x4x3ch) into a C-dimensional token, where C depends on the size of the model.

Attention is calculated for the tokens obtained from each patch using the Swin Transformer Block, and feature extraction is performed.

Swin Transformer Block: The Multi-head Self-Attention (MSA) used in the regular Transformer Block is replaced with Shifted Window based Self-Attention (W-MSA and SW-MSA). The following section provides more details. The next section describes them in more detail. Other than that, the structure is almost identical to that of the regular Transformer.

The Linear Embedding and Transformer Block sections are called Stage 1, and there are 1 to 4 stages, each of which has a different patch size, allowing feature extraction at various scales. The reason for the different patch sizes is that patch merging is used to aggregate neighboring patches.

Patch Merging: In each stage, patches (tokens) in the neighborhood (2 × 2) are merged into a single token. Specifically, the 2 × 2 tokens are merged, and the resulting 4C-dimensional vector is made 2C-dimensional by a linear layer. For example, in Stage 2, (H/4)×(W/4)×C-dimensional to (H/8)×(W/8)×2C-dimensional.

In the next section, we will take a closer look at the Attention calculation in the Swin Transformer Block.

Shifted Window based Self-Attention

The difference between the Attention calculation for the regular Transformer and the Swin Transformer Block is explained in terms of computational complexity.

The usual Transformer calculates the Attention between all tokens, where h and w are the number of vertical and horizontal patches in the image, and the amount of calculation is as follows

On the other hand, the Swin Transformer calculates Attention only within a window consisting of multiple patches; a window contains M x M patches, basically fixed at M = 7. The computation volume is shown in the following equation.

In the normal transformer, the computational complexity increases in proportion to the square of the number of patches (hw). The Swin Transformer, on the other hand, increases computational complexity as the square of M, but since M=7, the effect is small and the increase is kept to the order of 1 for the number of patches (hw). Therefore, the Swin Transformer allows Attention calculations with small patch sizes.

Next, we will explain how to divide the image into windows. windows are arranged so that the image is evenly divided into M x M patches. Attention is calculated for each window placed in this way, so even if the patches are adjacent to each other, if they are different windows, Attention will not be calculated. To solve this window boundary problem, after calculating the first Attention (W-MSA : Window based Multi-Head Self-Attention), the window is shifted and the Attention is calculated again (SW- MSA : Shifted Window based Multi-Head Self-Attention).

Shift by ([M/2], [M/2]) pixels from the original window division, as shown in the following figure.

Efficient batch computation for shifted configuration

The size of the window in SW-MSA is different and the number of windows increases. Therefore, if the processing is done in a straightforward manner, the amount of computation increases compared to W-MSA. Therefore, SW-MSA does not actually change the arrangement of windows, but performs a pseudo operation using a method called cyclic shift.

As shown in the figure below, the entire image is shifted to the upper left and the outliers are shifted into the open areas (cyclic shift ) . By doing this, it is possible to calculate the same as the calculation of Attention in the window of W-MSA. In addition, since there may be non-adjacent patches in the window, masking is performed for those parts. The final output performs the reverse operation of cyclic shift (reverse cyclic shift) to return the patches to their original positions.

Architecture Variants

The Swin Transformer is available in T, S, B, and L sizes, and the number of dimensions (dim), heads (head), and blocks at each stage differ as shown in the table below.

Experiment

Comparisons with other models have been performed on ImageNet-1K for the image recognition task, COCO for the object detection task, and ADE20K for the semantic segmentation task, all achieving the highest accuracy. (Detailed experimental results can be found in Tables 1~3 in Chapter 4 of the paper.)

Ablation studies in SW-MSA have been conducted, and it has been confirmed that the accuracy in both tasks is higher when SW-MSA is introduced than when W-MSA alone is introduced.

Summary

This article introduced the Swin Transformer, which is used as a baseline in computer vision research.

Unlike ViT, which calculates Attention among all the patches, it is possible to extract features at various scales by repeating Attention calculation and patch aggregation in a window that summarizes neighboring patches. Another advantage of ViT is that it does not compute Attention among all patches, thus reducing the amount of computation and allowing feature extraction from smaller patch sizes.

We hope this will help those who have recently started learning about AI to understand it.