Catch up on the latest AI articles

ACC-UNet: Fully Convoluted UNet For The 2020s

ACC-UNet: Fully Convoluted UNet For The 2020s

Neural Network

3 main points
✔️ Propose a new fully convolutional UNet by introducing new convolutional blocks and skip connections to the conventional UNet
✔️ The proposed UNet can exploit both the inductive bias of CNN and the global feature extraction capability of Transformer

✔️ Achieve SOTA accuracy in UNet on 5 different tasks

ACC-UNet: A Completely Convolutional UNet model for the 2020s
written by Nabil IbtehazDaisuke Kihara
(Submitted on 25 Aug 2023)
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


This decade has been characterized by the introduction of the vision transformer, a fundamental paradigm shift in computer vision in general. A similar trend can be seen in medical imaging, where one of the most influential architectures, UNet, is being redesigned with transformers.

Recently, the effectiveness of convolutional models in vision has been re-examined through groundbreaking works such as ConvNext. Inspired by these developments, we aim to improve pure convolutional UNet models so that they perform as well as transformer-based models such as Swin-Unet and UCTransNet.

In this commentary paper, we examined several advantages of the transformer-based UNet model, mainly its ability to extract global features and cross-level skip connections. Emulating them through convolutional operations, we propose ACC-UNet, a fully convolutional UNet that brings the good points of both methods, namely the inherent inductive bias of convolution and the ability to extract global features of transformers.

ACC-UNet was evaluated on five different medical image segmentation benchmarks and consistently outperformed convolutional nets, transformers, and their hybrids. Notably, ACC-UNet outperforms the state-of-the-art models Swin-Unet and UCTransNet by 2 .64 ± 2 .54% and 0 .45 ± 1 .61%, respectively, in terms of die scores while using a fraction of their parameters (59 .26% and 24 .24%) are.

Proposed Method

Figure 1: Overview of the proposed method

ACC-UNet Overview

The overall overview is shown in Figure 1-A. The proposed method replaced the traditional U-Net convolution block with a HANC block that introduces Self-Attention. Also, the conventional simple skip connection was replaced by an MLFC block that takes into account feature maps at different encoder levels. The following subsections describe the HANC and MLFC blocks in more detail.

Hierarchical Aggregation of Neighborhood Contexts (HANC)

First, consider how to introduce long-range dependence along with improved expressivity in the convolution block. To reduce computational complexity, only pointwise and depthwise convolutions are used.

We propose to include an inverted bottleneck in the convolution block to increase representational power. This can be accomplished by increasing the number of channels from cin to cinv = cin∗invf using point-by-point convolution. Since these additional channels increase the complexity of the model, we reduce the computational complexity by using a 3x3 depth-by-depth convolution, as shown in Figure 1-B.

To improve the ability to extract global features, Self-Attention is mimicked in the convolution block. It focuses on comparing a pixel to other pixels in its neighborhood. This comparison can be simplified by comparing it to the average and maximum values of its neighbors. Adding the mean and maximum values of the features of neighboring pixels provides an approximate concept of neighborhood comparison. Convolution on a point-by-point basis in succession then takes these into account and captures contrasting perspectives. Because hierarchical analysis is beneficial to the image, this aggregation is computed hierarchically at multiple levels. For example, a 2k-1 x 2k-1 patch.

The proposed HANC expands the feature map x1 ∈ R cinv,n,m as x2 ∈ R cinv∗(2k-1),n,m ( Figure 1-B). || denotes concatenation along the channel dimension.

Then, as with the transformer, include shortcut connections in the convolution block to improve gradient propagation. Thus, we perform another pointwise convolution to reduce the number of channels to cin and add them to the input feature map.Thus, x2 ∈ R cinv∗(2k-1),n,m becomes x3 ∈ R cin,n,m ( Figure 1-B) .

Finally, change the number of outputs to c_out as outputs. This is done using a point-by-point convolution (Figure 1-B).

Multi-Level Feature Compilation (MLFC)

Next, we investigate another of the advantages of transformer-based UNet: the possibility of multi-level feature combinations.

Transformer-based skip connections effectively fuse features at the encoder level and provide for proper filtering of feature maps at individual decoders. This is accomplished by concatenating tokens from different levels.

This paper follows this approach, resizing and concatenating convolutional feature maps obtained from different encoder levels to the same size. It then merges the feature maps from different semantic levels and summarizes them with a point-by-point convolution operation. It then combines this with the corresponding encoder feature maps and integrates the information through another convolution.

For features x1, x2, x3, and x4 from four different levels, the feature map is enriched with multilevel information, as in the following equation (Figure 1-D).

where resizei(xj) is the operation to resize xj to the size of xi, ctot = c1 + c2 + c3 + c4. This operation is performed separately for all the different levels.



To evaluate ACC-UNet, we experimented with five public datasets across different tasks and modalities: ISIC-2018 (dermatology, 2594 images), BUSI (breast ultrasound, 437 benign and 210 malignant images), CVC-ClinicDB (colonoscopy, 612 images), COVID (pneumonia lesion segmentation, 100 images), and GlaS (gland segmentation, 85 training images and 80 test images).

All images and masks were resized to 224 × 224; for the GlaS data set, we considered the original test split as test data. For the other datasets, 20% of the images were randomly selected as test data. The remaining 60% and 20% of the images were used for training and validation, and the experiment was repeated three times with different random shuffling.

Comparison with the conventional method SOTA

Table 1: Results of comparison with the conventional method SOTA

The proposed method was compared to UNet, MultiResUNet, Swin-Unet, UCTransnet, and SMESwin-Unet. Table 1 shows the die scores obtained on the test set.

For a relatively large dataset (ISIC-18), the transformer-based Swin-Unet achieved the second best results. On the other hand, for a smaller dataset (GlaS), a lightweight convolutional model (MultiResUNet) achieved the second best score. For the other datasets, the hybrid model (UCTransnet) was the second best method; SMESwin-Unet lagged behind in all cases, despite having a large number of parameters.

ACC-UNet, on the other hand, combined the design principles of a transformer with the inductive bias of a convolutional neural network to achieve the best performance in all the different categories.

For the five datasets, the die scores improved by 0.13%, 0.10%, 0.63%, 0.90%, and 0.27%, respectively. Thus, ACC-UNet is not only highly accurate, but also effectively uses relatively small parameters; in terms of FLOPs, the proposed method is comparable to convolutional UNets, since transformer-based UNets perform extensive downsampling during patch segmentation, smaller FLOPs.

Qualitative evaluation on five data sets

ACC-UNet not only achieved higher dice scores, but also generated clearly better qualitative results.

Figure 2 shows a qualitative comparison of ACC-UNet with other models. Each row of the figure contains one example from each dataset, with the segmentation and ground-truth mask predicted by ACC-UNet displayed in the two right-hand columns. In the second example from CVC-ClinicDB, the model was able to distinguish the finger from the polyp almost perfectly.

Then, in the third example of BUSI, the prediction of the proposed method filtered out the obvious nodule area on the left side, but excluded tumors that were falsely detected by all other models. Similarly, in the fourth sample of the COVID dataset, the proposed method was able to visually better model the left lung coagulation gap, thereby yielding a die score 2.9% higher than the second best method.

From the last example, the GlaS dataset, the proposed method not only accurately predicted the glands in the lower right corner, but also separately identified glands in the upper left that were mostly missed or merged by the other models.

Figure 2: Qualitative evaluation on five data sets.


In this experiment, we recognized the advantages of different design paradigms for the transformer and investigated the suitability of similar ideas in a convolutional UNet. The results show that the proposed ACC-UNet has the inductive bias of a CNN, merged with the long-range and multilevel feature accumulation of the transformer.

Experiments show that this integration does indeed have the potential to improve the UNet model. One limitation of the proposed method is the delay from the consolidation operation, which could be solved by alternative methods. In addition, there are other innovations brought by the transformer, such as layer normalization, GELU activation, and AdamW optimizer. These efforts are expected to further improve the effectiveness of the proposed method.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us