Catch up on the latest AI articles

U-Net: Convolutional Networks For Biomedical Image Segmentation

U-Net: Convolutional Networks For Biomedical Image Segmentation

Computer Vision

3 main points
✔️ Successful deep network training requires thousands of annotated training samples.
✔️ The architecture consists of a reduced path to capture context and a symmetric extended path that allows for accurate localization.

✔️ It has been shown to perform better than the best previous method (sliding window convolution network) on the ISBI task of segmenting neural structures in electron microscope stacks.

U-Net: Convolutional Networks for Biomedical Image Segmentation
written by Olaf RonnebergerPhilipp FischerThomas Brox
(Submitted on 18 May 2015)
Comments: conditionally accepted at MICCAI 2015

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In this paper, data expansion is introduced for efficient data utilization in training deep neural networks. The proposed architecture consists of a reduced path that captures the context of the image and a symmetric extended path that allows for accurate localization. The network was trained from only a few images and showed better performance than previous methods in segmenting neural structures in electron microscopy stacks. The same network, trained on transmitted light microscopy images, also won the ISBI Cell Tracking Challenge and achieved fast processing speeds.


This paper focuses on the evolution of deep convolutional networks and their challenges in biomedical image processing. Convolutional networks, which previously had limited success, are now excelling in visual recognition tasks due to large training data and complex network structures. The author proposed a "fully convolutional network" specifically for biomedical segmentation tasks, achieving high accuracy even with a small number of training images. This architecture extends the usual contract network and introduces an up-sampling operator to perform semantic segmentation. The final network is symmetric U-shaped and allows accurate segmentation on a pixel-by-pixel basis while retaining high contextual information.

It also describes the importance of a tiling strategy that mirrors the input image and extrapolates missing context in order to predict pixels in the boundary region of the image. Also, due to limited training data, elastic deformations are applied to introduce data extensions to the network, allowing it to learn invariance to deformations. This allows us to address real deformations in biomedical segmentation. In addition, the focus is on the task of separating touching objects of the same class.For this purpose, we propose the use of weighted loss, where the separating background labels between contacting cells acquire a large weight in the loss function.

Network architecture

Here we present a network architecture for segmentation. The network consists of a reduction path (left side) and an expansion path (right side), combined with convolution and pooling. Features are augmented using a combination of up-sampling and convolution to produce the final segmentation map. The network has a total of 23 convolution layers, with the last layer using 1x1 convolution to map to classes. To achieve seamless tiling of the segmentation map, it is important to choose the size of the input tiles so that the maximum pooling operation of 2x2 is applied to layers of equal size.


This paper uses Caffe's stochastic gradient descent method to train an input image and its corresponding segmentation map. Because the convolution is not padded, the output image has smaller boundaries than the input; large input tiles and high momentum (0.99) are used to maximize GPU memory utilization and minimize overhead. This ensures effective learning and training.

The energy function is computed by a pixel-by-pixel softmax function for the final feature map combined with a cross-entropy loss function. The softmax function is,

is defined as where ak (x) denotes the activation of the feature channel k at pixel position x ∈ Ω and Ω ⊂ Z 2. K is the number of classes and pk (x) is the approximate maximum function. That is, pk (x) ≈ 1 for k with maximum activation ak (x) and pk (x) ≈ 0 for all other k. Cross-entropy then penalizes the deviation of pl(x) from 1 at each position using

where l : Ω → {1, ... , K} is the true label of each pixel and w : Ω → R is a weight map introduced to increase the importance of some pixels in training.

The weight map for each ground-truth segmentation is precomputed to compensate for the different frequencies of pixels from a given class in the training data set and to force the network to learn small separation boundaries introduced between touching cells (see Figure 3c).

Separation boundaries are computed using morphological operations. The weight map is computed as follows

Here, a method is presented to properly initialize weights for deep networks, using weight maps to balance class frequencies and distances to cell boundaries. In particular, it is important to set the initial weights so that each feature map in the network has approximately unit variance, and in the proposed architecture the initial weights are obtained from a Gaussian distribution with standard deviation √2/N. This ensures that each part of the network contributes equally and prevents excessive activation.

Data expansion

When training samples are lacking, data expansion is important to teach the network the desired properties. For microscopic images, the main requirements are shift and rotation invariance and robustness to deformations and gray value variations. In particular, random elastic deformation is an important concept for effectively training segmentation networks when few annotated images are available. Random displacement vectors are used to generate smooth deformations and dropout layers are used to perform further data expansion.


We present three different segmentation tasks using u-net. The first task is the segmentation of neural structures in electron microscopy recordings, using the ISBI 2012 EM Segmentation Challenge dataset. u-net achieved a warping error of 0.0003529 and a rand error of 0.0382 without any pre or post processing.

u-net shows superior results in different segmentation tasks. For neural structure segmentation of electron microscopy images, u-net achieves a Warping Error of 0.0003529 and a Land Error of 0.0382, which is better than the previous proposal. Cell segmentation of optical microscopy images also outperforms the competing algorithm significantly, with an average IOU of 92%.

u-net was also successful in the segmentation task of HeLa cells recorded by differential interference contrast (DIC) microscopy, achieving an average IOU of 77.5% on the DIC-HeLa data set, significantly better than competitive algorithms.


The u-net architecture offers excellent performance for different biomedical segmentation tasks. Data augmentation with elastic deformations allows effective training with only a few annotated images and can be trained in as little as 10 hours using NVidia Titan GPUs. A complete Caffe-based implementation and pre-trained network are provided and u-net is stated to be easily applicable to many more tasks.

U-Net has a wide range of potential applications in biomedical image analysis and segmentation. Its convenience and high flexibility could lead to revolutionary advances in medical image processing.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us