A New Resizer That Improves Performance On Image Tasks!

Image Recognition 28/04/2021

3 main points
✔️ A new adaptive image rescaling method using CNNs.
✔️ Works well with all types of architecture and consistently improves performance.
✔️ Makes image resizing at arbitrary scaling factors possible.

Learning to Resize Images for Computer Vision Tasks
written by Hossein Talebi, Peyman Milanfar
(Submitted on 17 Mar 2021)
Comments: NA
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Introduction

The recent achievements in computer vision are the result of two important factors: CNNs and large datasets like ImageNet. Besides these two important contributors, advancements in training methods, and data augmentations have also contributed to improving the performance of CNNs.One aspect which has been unfairly neglected is the image size. It is typical to downsample the image to a lower resolution (224x224) during training and evaluation using nearest neighbor, bilinear, and bicubic resizing. This is done due to three major reasons:1) memory limitations, 2) mini-batch learning requires images of equal size 3) training inference speed. This loss of information can have a significant impact on the accuracy of CNNs.

Recent works have shown some progress by training enhancement modules that are optimized for better perceptual quality. The ultimate goal of a recognition model should be to improve the accuracy and we believe that optimizing the module to make the intermediate image "look good" to humans may not translate to the same objective.

Therefore, in this paper, we introduce a new adaptive image resizer that is trained simultaneously with classification models. The resizer module works well with various classification models like Inception, DenseNet, ResNet, and EfficientNets, at arbitrary scaling factors.

The Resizer Model

Our image resizer model is simple and can be used with all types of architectures. It can be used to upscale, and downscale the images which enables us to effectively search for the optimal batch-size and image resolution for a particular architecture.

The above diagram shows the architecture. The two most distinct features are 1)the bilinear resizing and 1) the skip connections to facilitate the combination of the resized CNN features and the bilinearly resized image features. In the above diagram, the bilinear resizer acts as a feed-forward bottleneck. Nevertheless, it can also be used to upscale the image. This bilinear resizer can be replaced with any other differentiable resizing technique like bicubic or lanczos.

There are r ={1,2,3} identical residual blocks in the model. The intermediate convolutional layers all have n = 16 kernels of size 3 × 3. Only the first and the last layers consist of 7 × 7 kernels. We also make use of batch normalization layers and LeakyReLu activations (with 0.2 negative slope coefficient) as shown in the figure above.

The above table shows the number of parameters contained in the model in thousands. Even the largest model (93.37 thousand parameters) is much smaller than the baseline ResNet-50 model with 23 million parameters(0.4%). So, our model does not increase the computational load by a lot. The model is trained using the cross-entropy loss with logits produced using a sigmoid layer. We also make use of label smoothing of 0.1 to reduce the overconfidence of the model.

In addition to classification models, we also train an Image Quality Assessment Model(IQA) on the AVA dataset to evaluate the quality of resized images from our model. The AVA dataset has a histogram of human-annotated scores ranging from 1 to 10. So, the last later consists of a softmax layer that outputs 10 logits. This IQA model is trained using the Earth Mover's Distance(EMD) regression loss.

Here CDF is the cumulative distribution function and d is set to 2. p_k and q_k are the prediction and label for the k^th class(K=10 for AVA dataset). This loss compels the model to learn the distribution of human ratings.

Experiments

First, we train the baseline model using the bilinear and the bicubic methods. The weights of these trained CNN models are used to initialize the classification and IQA models that make use of our learned resizers. We train the proposed resizer at different image sizes from 224x224 to 448x448 and the resizer's input dimension is always kept greater than or equal to the output size. The batch sizes are constantly adjusted because of memory limitations.

The above table shows the results on classification in the ImageNet dataset. The bold numbers highlight the performances in the 224x224 category. We found that increasing input resolution benefits the performance of DenseNet-121, ResNet-50, and MobileNet-v2 but not Inception-v2.

Here is a visual sample of the resizing done by various models. The resizer models tend to boost the high-frequency details. The images formed by other models except the MobileNet-v2 are quite sharp.

The above table shows the results of the AVA dataset. The performance is measured by the correlation between the mean ground truth score and mean predicted score: Pearson linear correlation coefficient (PLCC), and Spearman rank correlation coefficient (SRCC). Like the classification task, the resizer consistently improves the performance of the baseline models.

In order to test how well the resizer models generalize, we tried substituting the CNN models with other CNN models while keeping the resizer constant. With fine-tuning for about 4 epochs, the resizer model effectively adapts to the target model.

Ablations: The resizer model has two hyperparameters: the number of resblocks(r), and the number of filters(n). We conducted a series of experiments to determine the optimal choice for them.

We found that n=16 and r=1 was the best choice overall and used these values for all our experiments.

Summary

The adaptive image resizers significantly improve the performance of the image classification task regardless of the architecture used. However, there is still significant room for improvement. The resizer model brings with it the trouble of tuning two additional hyperparameters(r,n) and the resizer trained on one architecture needs to be fine-tuned while using with another architecture. Future works could look for a universal adaptive resizer model(one-fits-all) that also works well with other tasks like image segmentation, object detection, and Visio-textual tasks.

Categories related to this article

Thapa Samrat: I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.