Cutblur: What Is The Optimal Data Augmentation For Super-resolution Tasks?

Super Resolution 01/04/2021

3 main points
✔️ A comprehensive study of Data Augmentation methods in super-resolution tasks
✔️ Cutblur makes models learn where and how they should be super-resolved
✔️ Various benchmarks Perceptual quality improvement on super-resolution and denoising tasks in

Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy
written by Jaejun Yoo, Namhyuk Ahn, Kyung-Ah Sohn
(Submitted on 1 Apr 2020 (v1), last revised 23 Apr 2020 (this version, v2))
Comments: Accepted to arXiv.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

code：

First of all

Since the development of DNN, most of SoTA image super-resolution is realized by data-driven Neural Network (NN) as in other computer vision fields. A common problem of NNs is the increase in learning cost due to the network length. For typical image classification tasks, various Data Augmentation (DA) techniques have been used in parallel with fancy network development to improve the performance, but few studies have been done for super-resolution tasks. It is empirically known that DA focusing on individual tasks suppresses overlearning and contributes greatly to model generalization. Against this background, it was worthwhile to explore the possibility of DA for super-resolution tasks as well.

In this article, we describe "Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy," reported at CVPR in 2020. " reported at CVPR 2020. As shown in the figure below, this paper comprehensively examines the various DAs surrounding the super-resolution task. In addition, the paper not only reapplies existing methods found in classification tasks, etc. but also conducts new research on optimal DA methods for super-resolution tasks. These are Cutblur and CutMixup. Among them, Cutblur has attracted much attention because it is a smart DA focused on super-resolution tasks, and CutMixup has been applied to classification tasks. In the paper, results and analyses of these methods are abundantly presented, and important findings for task-based DA are included.

In this article, we will explain this paper in the following four sections.

Analysis of conventional DA methods for super-resolution tasks
How Cutblur Works
What learning does Cutblur encourage in the model?
Comprehensive experimental results

Analysis of conventional DA methods for super-resolution tasks

Data-level extension

First, let's look at how super-resolution works on the current data-level extensions: DIV2K and RealSR are the synthetic/real datasets used in the super-resolution task, PSNR is an evaluation metric to calculate the approximation of the image, and EDSR is the baseline single-image super-resolution model. EDSR is the baseline single image super-resolution model. Looking at the results for each DA method, we can see that the scores for all Augmentations have improved over the baseline. Especially, the following two points should be noted in this table.

CutMixup is a new method of Cutmix with less boundary change => higher accuracy than Cutmix
Mix type (Mixup, CutMixup, RGB perm) is more accurate than Cut type (Cutout, CutMix).

These two points suggest that in the super-resolution task, it is important to suppress the structural change of the image; in the Cut system, a part of the image is cut off, which causes a large change in the structure, while in the Mix system, the structural information is preserved. CutMix, which is excellent for classification tasks, does not work well for super-resolution, suggesting that Mixup or simple RGB conversion is more suitable for super-resolution tasks. In fact, when I changed the size of the Cutout rectangle and dropped extra structural information, the accuracy deteriorated as shown in the figure below, which reinforces this claim. (The red line is the most missing).

Feature Level Extensions

In general, DA often refers to the previous data-level expansion, but in a broader sense, DA also includes manipulation of the feature space. The most typical example is Dropout, which ignores a particular neural. Other examples are Manifold Mixup, which mixes the outputs of intermediate layers, and ShakeShake and ShakeDrop, which disrupt the calculation of weight updates in forward and backward by random numbers. These feature manipulations uniformly reduce the score of the super-resolution task, as shown in the figure below (ManifoldMixup, ShakeDrop), and it can be confirmed that +MM and +SD are below the baseline in both RCAN and EDSR models. The reason for this is almost the same as above and is thought to be due to the large disturbance of the structural information at the feature level after convolution.

But why does this happen with different tasks? To put it simply, the classification model aims at the ultimate abstraction of the image, while the super-resolution task aims at the concretization (restoration) of the image. It has been suggested that this difference in levels (*) leads to the necessity of maintaining structural information (both local and global) in the image space. In the literature, we use the terms high-level and low-level to refer to this difference.

How Cutblur Works

What we have learned from the analysis of conventional DA is that it is important to have DA that retains structural information. Based on this knowledge, Cutblur is a natural solution to this problem. Cutblur cuts and pastes a low-resolution image into a high-resolution image by the operation shown in the above figure. Simply put, it makes a specific part of the image low-resolution. This can be expressed in CutMix-compliant mathematical expressions as follows.

$\hat{x}_{HR→LR}=M \odot x_{HR} + (1-M) \odot {x^s}_{LR}$

$\hat{x}_{LR→HR}=M \odot {x^s}_{LR} + (1-M) \odot x_{HR}$

where $x_{LR}$ is a low-resolution image and $x_{HR}$ is a high-resolution image. To map images with different resolutions, we need to align the low-resolution image to the high-resolution image in order to map them to the same spatial region. Specifically, we scale up the image to $x_{HR}$, as in ${x^s}_{LR}$. The features of this Cutblur are summarized below. You can see that this DA is optimized for super-resolution tasks.

It does not cause abrupt boundary changes in the image due to content changes as CutMix does.
Nor does it produce unrealistic images like Vanilla Mixup.
It does not produce a lack of structural information like the Cut system.

What learning does Cutblur encourage in the model?

It is shown in the paper that Cutblur allows the model to learn where and how to super-resolve. In other words, the model learns where and how to super-resolved, i.e., where to keep the original high-resolution part and where to concentrate on the low-resolution part.

The high-resolution part is still there.

In the figure above, it is shown that when a high-resolution image (HR) is input, the baseline single-image super-resolution over-emphasizes the edges to the extent that it hurts the eyes (EDSR w/o Cutblur). The model trained by Cutblur, however, does not unnecessarily emphasize the edges. (Referring to the figure, if we look at the blue and yellow residual images, we can see that the residuals are sharply contoured in the case of w/o Cutblur, whereas in w/Cutblur there are almost no residuals and the image is pure blue → i.e. the resolution is not increased excessively).

Concentrate on low-resolution areas.

Cutblur also improves the low-resolution regions contained in the image. In the above figure, Cutblur images are input to each model, and we can see that w/ has almost no residuals in the HR region and suppresses sharp residuals in the LR region (although there is no difference at first glance, if you strain your eyes, you can see the red and white rectangles in the right part of the LR region. ). Although there is a criticism that such a condition is limited in real images, the paper claims that such a condition exists in a variety of images in the real world, using out-of-focus images as a counterexample. On the basis of the above, the conventional HR image learns only how to super-resolved in the global range, but the Cutblur image can learn where and how to super-resolve.

Comprehensive experimental results

Finally, let's look at the experimental results. In this article, we will focus on the following three experiments, although there are some experiments that consider the variety of model sizes and dataset sizes in this paper.

SR performance on various benchmarks
Verification on Cutblur-like real images.
Verification (denoising) by low-level image restoration task

SR performance on various benchmarks

The above table shows the results for the synthetic dataset (DIV2K) and the real image dataset (RealSR), where CARN is a small-scale model and RCAN and EDSR is large-scale models. In addition, small-scale models such as CARN For small models such as CARN, the SR performance is low and there is no time to train Cutblur, which leads to insufficient training and low proposed performance. However, we can see that even such a small model contributes to the score improvement on the real image dataset. The following figure shows the qualitative results for Urban100. Because it is a synthetic dataset, CARN can only do so much, but the residuals that are too sharp for other models are suppressed.

Verification on Cutblur-like real images.

Here is an example of a real-world image where high resolution and low resolution are mixed. Specifically, the foreground and background have different resolutions. On the left is an image taken from the web, and on the right is an image taken by the iPhone 11 Pro. It is especially noticeable in the example on the right side of the iPhone11. w/o Cutblur, the area around the letters shown in the red frame is blurred, which gives a strange impression, but w/o Cutblur, it is not, and the residual image is clearly improved. As for the bird on the left, we can confirm that the super-resolution of w/ is more reasonable, especially in the eyes and face. This suggests that Cutblur is effective because Cutblur-like cases exist in real images.

Verification by Low-Level Image Recovery Task

The last task is to remove the Gaussian noise in the image, where Train $\sigma$ means how much noise to apply to train the model. In the test, we apply $\sigma = 30$ of noise; lower values of LPIPS only mean better improvement. In addition, SSIM and LPIPS are closer to human perception than PSNR. Except for the PSNR in the bottom row, all of them show improvement. You can also see in the figure below that when the test image: upper right is input, the baseline: lower left removes (blurs) noise excessively to improve PSNR, while the proposed: lower right removes noise more reasonably.

Summary

In this article, we explained the optimal DA method for the super-resolution task, focusing on Cutblur, which can learn not only how but also where the model should be super-resolved. In this article, we have omitted some of the extensive comparisons in the original paper. This paper is a very logical and experimental investigation of task-based DA and is a must-see if you are considering task-based DA, even if you are not interested in Super-Resolution. With more and more flashy, complex, and large-scale network models appearing every day, we can foresee that the demand for such comprehensive DA research will continue to grow under the radar.

Categories related to this article

tam_mkmk