Catch up on the latest AI articles

Take UNet To The Next Level! Enhance UNet With Transformer

Take UNet To The Next Level! Enhance UNet With Transformer


3 main points
✔️ Proposes TransUNet, a model that combines UNet and Transformer
✔️ The combination of CNN's locality and Transformer's long-term dependence is important
✔️ Achieve segmentation accuracy beyond traditional methods on two medical image datasets

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
written by Jieneng ChenYongyi LuQihang YuXiangde LuoEhsan AdeliYan WangLe LuAlan L. YuilleYuyin Zhou
(Submitted on 8 Feb 2021)
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Segmentation of medical images is being studied every day because it is very important as a preprocessing step for medical applications. Recently, models using deep learning have achieved high segmentation accuracy.

One of the most successful models for medical image segmentation is UNet, which is a CNN model with a U-shaped architecture. However, UNet has a weakness in segmentation: it is not good at capturing long-term dependencies in segmentation. It is said that the reason is that CNNs, which constitute UNet, are good at capturing local features, while they are limited in capturing long-term features.

The strength of the Transformer is that it captures long-term dependencies. Therefore, it is expected that the Transformer can compensate for the weakness of UNet and improve the segmentation accuracy.

In this paper, we propose a model called TransUNet, which is a combination of UNet and Transformer and can perform more accurate segmentation than conventional methods by successfully combining CNN, which is good at capturing local features, and Transformer, which is good at capturing long-term features. TransUNet can perform more accurate segmentation than conventional methods.

As a result, we have achieved segmentation accuracy that exceeds that of conventional methods on two medical image datasets. In addition, our experiments show that combining CNN and Transformer provides more accurate segmentation than using CNN and Transformer alone.

In this article, we present an overview of TransUNet and the results of our experiments with medical image datasets.


The above figure shows the architecture of TransUNet, which in brief is a model of a UNet encoder with an embedded Transformer (ViT). In the following, the encoders and decoders of TransUNet will be described.

The TransUNet encoder first extracts features with CNN to capture local features. After that, it extracts features with a transformer to capture long-term features. In TransUNet, ResNet50 and ViT, which have been trained on ImageNet, are used as CNN and Transformer, respectively.

The decoder of TransUNet performs upsampling as well as UNet, and finally outputs the segmentation result. In addition, the CNN of the encoder and the corresponding layer of the decoder are connected by skip-connection.


medical image data set

In this paper, we perform segmentation experiments using the following two medical image datasets.

  1. Synapse multi-organ segmentation dataset
    • Data set of abdominal CT images
    • Segmentation of 8 sites
  2. Automated cardiac diagnosis challenge (ACDC)
    • MRI dataset of the heart
    • Segmentation of three sites


The Dice coefficient (DSC, in %) and the Hausdorff distance (HD, in mm) are used to evaluate the model; a larger DSC indicates a higher segmentation accuracy and a smaller HD indicates a higher segmentation accuracy.

The segmentation accuracy on the Synapse multi-organ segmentation dataset is as follows.

TransUNet (DSC: 77.48 %, HD: 31.69 mm) achieves better segmentation accuracy than the traditional methods (V-Net, DARR, U-Net, AttnUNet). The fact that TransUNet achieves better segmentation accuracy than the model where the encoder is only the Transformer (the model where the Encoder is ViT and the Decoder is CUP) shows that it is important to combine CNN and the Transformer.

The segmentation accuracy on the ACDC dataset was as follows.

Even on the ACDC dataset, TransUNet has the highest segmentation accuracy (DSC: 89.71 %) compared to the traditional methods (R50-U-Net, R50-AttnUNet) and the Transformer-only model.

Segmentation Visualization

The figure above shows the actual segmentation images from the Synapse multi-organ segmentation dataset, showing that the TransUNet provides more accurate segmentation than the other models. TransUNet is more accurate than other models.

For example, compare the segmentation images in the second row: UNet incorrectly segments the left kidney (red) into the spleen (light blue), and AttnUNet incorrectly segments the spleen (light blue) into the river (purple) On the other hand TransUNet, on the other hand, correctly segments the spleen (light blue).


In this talk, we introduced TransUNet, a model combining UNet and Transformer for medical image segmentation, which successfully combines the advantages of CNN and Transformer to achieve segmentation accuracy beyond that of conventional models.

TransUNet was a hybrid model of CNN+Transformer, but CNN-free models for segmentation have also been developed. It will be interesting to see whether the hybrid model or the CNN free model will dominate the segmentation task in the future.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us