Catch up on the latest AI articles

Transformer Achieves High Accuracy In Dense Prediction Tasks!

Transformer Achieves High Accuracy In Dense Prediction Tasks!


3 main points
✔️ Use of vision transformers as the backbone for dense prediction tasks.
✔️ Significant improvements on dense prediction tasks over CNN backbone.
✔️ New SOTA on several datasets NYUv2, KITTI, ADE20K, Pascal Context. 

Vision Transformers for Dense Prediction
written by René RanftlAlexey BochkovskiyVladlen Koltun
(Submitted on 24 Mar 2021)
Comments: Accepted to arXiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)



Convolutional Neural Networks(CNNs) form the backbone of almost all dense prediction tasks. Most dense prediction architectures can be broadly divided into two sections: an encoder and a decoder. The encoder is usually a strong model that has been pretrained on a large dataset like ImageNet. CNN encoders progressively downsample the feature maps allowing extraction of features at different scales. This downsampling reduces the memory costs for computations in the decoder but also causes loss of information which is impossible to regain in the decoder.  

In this paper, we introduce the Dense Prediction Transformer(DPT). It is based on the idea of using the Vision Transformer(ViT) as the encoder for dense prediction tasks. We found that dense prediction benefits from the one-time initial downsampling and global receptive field of ViT. Using our method, we were able to set the new state of the art in the two dense prediction tasks we tested on: monocular depth estimation (performance boost of more than 28%) and semantic segmentation. 

Proposed Model: DPT

The ViT transformer is very similar to the original transformer model. The only difference is in the initial image-processing layer. It uses several blocks, each composed of a multi-headed self-attention layer followed by feed-forward layers.  The image is divided into several non-overlapping patches, which are flattened and transformed using a learned linear layer. Eg: a 384×384 image can be broken into 16 × 16 patches each flattened to a dimension of 24x24=576. An alternative image-processing layer uses a ResNet-50 architecture to extract feature maps, which are then used as features for the transformer. ( For more information, please see this article. )

Since transformers are permutation invariant, positional information needs to be added to the input sequence through positional embeddings. In addition, a special token called readout token(red-colored in the above diagram) is appended to the sequence, whose representations are used for classification. The input tokens are transformed using L transformer layers. ViT has 3 variants: the ViT-base, ViT-Large, and ViT-Hybrid. ViT-base and Vit-Large have 12 and 24 transformer layers respectively and transform the flattened patches to dimensions 768 and 1024 respectively. ViT-Hybrid makes use of ResNet-50 to compute image embeddings and extracts features with 1/16th the resolution of the input image.

Convolutional Decoder

The decoder assembles output tokens from four arbitrary transformer layers at four different resolutions: layers {5, 12, 18, 24} for ViT-large, layers {3, 6, 9, 12} for ViT-base, and {features from the first and second ResNet block from the embedding network, 9, 12} for ViT-Hybrid. Using a simple three-stage reassemble operation, we recover image-like representations from these output tokens.

Here, s denotes the ratio of output size of recovered representation to the input image size, and D' is the output feature dimension.

As shown in the diagram in the middle, the Np+1 tokens are first transformed into Np tokens to be able to concatenate them into an image-like form. This can be done in three different ways: either by ignoring the readout token Readignore, adding the readout token to all other tokens Readadd, or by concatenating the readout token to all other tokens and then transforming to the original feature dimension D using an MLP (linear layer+GELU) Readproj

The tokens are concatenated together and resampled from shape (H/p)x(W/p)xD to shape (H/s)x(W/s)xD'. 1 × 1 convolutions project the input representation to D' (256) dimension. When s ≥ p a stridden 3 × 3 transpose convolution is used and when s < p, a 3x3 convolution is used for spatial upsampling and downsampling respectively. 

As shown in the rightmost diagram the decoder uses a RefineNet-based feature fusion block. The representation is progressively upsampled at each step such that the final representation's size is half of the input image size.  The DPT architecture can handle variable input sizes as long as the input dimensions are divisible by p. For every image, the position embeddings can be linearly interpolated to the appropriate size.


We test DPT on two major dense prediction tasks: monocular depth estimation, and semantic segmentation. We chose monocular depth estimation (MDE) because transformers work well with more data and a large amount of metadata can be easily constructed from available MDE datasets.

We trained our DPT models on a meta-data set with over 1.5 million images called MIX 6 that we compiled. The above table shows the zero-shot transfer results on several unseen MDE examples. Our models outperform all the other SOTA models including the MiDaS model. To ensure that our performance gains are not only due to larger dataset, we trained MiDaS on the MIX 6 dataset but it was still unable to beat the DPT models. 

We tried finetuning the DPT models to smaller datasets NYUv2(left) and KITTI(right) and DPT-Hybrid was able to set the new state of the art in both the datasets.

The above table shows the results of the ADE20K semantic segmentation benchmark. DPT models were trained for 240 epochs. DPT-Hybrid outperforms all other models while DPT-Large performs slightly worse, probably due to the relatively smaller size of the dataset. For more details on the experiments, please refer to the original paper. 


It is quite clear from the experiments that the DPT models are effective in dense prediction tasks. DPTs were able to improve the state of the art on several benchmarks and it is understood that like other transformer-based architectures, their performance gets better with large-scale datasets. Future works should assess the performance of the DPT models on other dense prediction tasks like instance segmentation(using the COCO benchmark) and work on bringing the effectiveness of DPTs to domains with scarcer data. 

Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us