Catch up on the latest AI articles

U-Net And Transformer Combined! Introducing Swin Unet, A New Network For Medical Image Segmentation.

U-Net And Transformer Combined! Introducing Swin Unet, A New Network For Medical Image Segmentation.


3 main points
✔️ CNNs have made breakthroughs in medical image analysis, but they cannot learn global information due to convolutional operations.
✔️ Since Transformer can learn global information, we propose a Transformer-based U-Net, Swin-Unet, in this paper.
✔️ Validated on a multi-organ segmentation task, we demonstrate that Swin-Unet outperforms CNN-based U-Net.

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation
written by Hu CaoYueyue WangJoy ChenDongsheng JiangXiaopeng ZhangQi TianManning Wang
(Submitted on 12 May 2021)
Comments: Published on arxiv.

Subjects:  Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In the last few years, convolutional neural networks (CNNs) have made breakthroughs in medical image analysis. Especially in segmentation tasks, U-Nets are widely used in the medical field of deep learning.

However, it has been pointed out that CNNs cannot learn global features well due to the local nature of convolutional operations. In comparison, Transformer learns global features, which has led to its use in the medical field.

In this study, we proposed a Transformer-based U-Net, Swin-Unet, for multi-organ segmentation. The results demonstrate that Swin-Unet outperforms CNN and Transformer+CNN methods.


Recently, a network called Transformer has been reported in the natural language field and is rapidly gaining popularity due to its high performance; the core idea of Transformer is called the Attention mechanism, which is a method to specify which words in the input are associated with which words in the output in a translation task and to calculate how important those words are in the sentence. words in a translation task are connected to words in the output, and then calculate how important these words are in the sentence. Since the output is determined by considering the importance of the word in the whole sentence, the Transformer is very good at learning global information. The architecture of the Transformer is shown below.

In contrast, the core idea of CNN in the image field is convolution. Convolution is the process of aggregating information from a collection of pixels that make up an image. It is called convolution because the processing is similar to folding an image to make it smaller. In other words, the output is determined by aggregating and accumulating information such as changes in shape and color. Therefore, CNN is good at aggregating local information, and conversely, it cannot aggregate information between distant parts of an image.

Although the CNN and Transformer approaches were quite different, the Transformer Attention was later applied to the field of imaging, called Vision Transformer (ViT). The basic idea behind ViT is that you can segment an image and think of it as a word in the field of natural language. In other words, images are processed like sentences. As a result, it succeeded in learning the global information on the image.

And U-Net has been successful in medical image segmentation tasks. And various improvements such as 3D U-Net, Res-Unet, and U-Net+ have been reported. However, the structure of U-Net is CNN-based, so it is inherently incapable of aggregating global divisions. Therefore, in this study, we applied ViT to U-Net and proposed a new network called Swin-Unet.

In Swin-Unet, the input images are fed to a Transformer-based encoder to learn spatially broad features. The proposed method is validated on multi-organ segmentation and cardiac segmentation, and the results show that the proposed method has excellent segmentation accuracy and robust generalizability.

related research

Early medical image segmentation methods were machine learning based on contour information, and with the development of Deep CNN, U-Net was proposed. Net (Res-UNet, Dense-UNet, U-Net+, UNet3, 3D-Unet, V-Net, etc.) have been proposed. The U-Net is a network of networks.

As already mentioned for ViT, a method called Transformer (Self-Attention) has also been proposed to complement CNN, which adds Attention gates and skip connections to the existing U-Net. Note that the design philosophy of this method is different from that of Swin-Unet presented in this paper because it is a convolutional network. The following example shows how to use the convolutional network.

experimental procedure

The architecture of Swin-Unet is shown below.

The input image is first patched by separating it into 4x4 pixel squares, which are fed into the Swin Transformer block, where downsampling is performed in the Patch Merging layer The Patch Expanding layer does the opposite: it merges in the upsampled features from the encoder through skip connections to compensate for the loss of spatial information due to downsampling.

The Swin Transformer, the basic unit of Swin-Unet, is shown below.

The Swin Transformer does not use the traditional Multi-head Self Attention module in its entirety; the Swin Transformer consists of a LayerNorm layer (LN The Swin Transformer consists of a LayerNorm layer (LN) and a window-based Multi-head Self Attention (W-MHA) layer.


We performed the segmentation task using the Synapse multi-organ segmentation dataset (henceforth Synapse), which contains 30 cases (note: there are about 3800 images per case since they are CTs) The results are shown below.

Eight abdominal organs were used (aorta, gallbladder, spleen, left and right kidneys, liver, pancreas, and stomach) and the evaluation was performed using the average Dice coefficient (DSC) and mean Hausdorff distance (HD) were used for evaluation; the models in bold are those achieving the maximum score, e.g. Att-UNet performed best for the aorta (Aorta). The proposed Swin Unet achieves the highest scores for the left kidney, liver, spleen, and stomach and outperforms existing methods in the overall average The proposed Swin-UNet has the highest scores for the left kidney, liver, spleen, and stomach and outperforms the existing methods in the overall average.

An example of segmentation is shown above, from left to right: correct label, Swin Unet, TransUNet, AttUnet, and The yellow pancreas is a difficult organ to detect because it is flat; the bottom row shows that there is a large difference in the segmentation of the pancreas.


In this paper, we proposed a Transformer-based U-Net for medical image segmentation. To learn global information by Transformer, we used Swin Transformer The results show that Swin Unet performs well in the multi-organ segmentation task.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us