On The Challenge Of Four Different Image Generation Tasks And The Diffusion Model Palette

Diffusion Model 16/12/2021

3 main points
✔️ One Diffusion Model for all four tasks
✔️ Palette achieves SOTA on all tasks
✔️ Palette's generalizability allows it to successfully multitask image transformation

Palette: Image-to-Image Diffusion Models
written by Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, Mohammad Norouzi
(Submitted on 10 Nov 2021)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In recent years, generative models have been able to generate human-level sentences (How to Get the Real Value out of GPT-3 : Prompt Programming), highly accurate images, or human-like speech and music. In this context, GANs have gained a lot of attention with many SOTAs in hand. On the other hand (Did you beat BiGAN in image generation? About Diffusion Models), the Diffusion Model has recently achieved several SOTA, showing its potential.

In this article, we present a paper claiming that the Diffusion Model was able to achieve SOTA in four more tasks.

Let's take a look at the result figure (Figure 1.), where the first line is the input image, the second line is the output result of the Diffusion Model, and the third line is the reference image used for training. In this paper, we deal with four tasks: Colorization, Inpainting, Uncropping, and JPEG compression and restoration. The output result shows that it is so complete that there is no unnatural part.

Figure 2 shows an example of generating a panorama view. It takes a 256x256 pixel center image and expands it to twice the length on each side.

The distinguishing feature of this paper is that we show that different tasks can be performed by a single Diffusion Model. In other words, the paper tries to show that many tasks in the field of generative modeling can actually be replaced by Image-to-Image tasks, and that Diffusion Models can achieve SOTA in these tasks.

palette

Previous studies have shown that Conditional Diffusion Model can produce high-resolution images as well as Conditional GAN (pre-trained GAN model to super-resolution technique ), and Palette is conditioned on a reference image y for training. Palette is conditioned on a reference image y for training.

For more details about Diffusion Model, please refer to Appendix A or the related article (Beating BigGAN in image generation? About Diffusion Models ).

In the paper we briefly introduce the objective function (Equation 1). Given a reference image y, we add noise to get . A neural network is then trained to predict the noise, with the inputs being the image x and and the noise level .

In addition, some previous studies suggested that p=1 (L1 norm) is better, but we use p=2 in this study because we confirmed experimentally that the diversity of samples generated is higher with p=2. Furthermore, we use a network architecture based on the standard U-Net with some adjustments.

In this experiment, we use four quantitative evaluation metrics for the image-to-image translation task. In addition to IS and FID, which are commonly used metrics for generative models, we use Classification Accuracy (CA), which is the classification accuracy when using the trained ResNet-50, and Perceptual Distance (PD), which is the Euclidean distance in the representation space of Inception-v1. (PD) of the Inception-v1 representation space. In addition to these, we ask a human to answer "which image is generated from the camera" when given a reference image and a generated image. We use a new metric called fool rate to evaluate the percentage of wrong results.

experiment

We test Palette's generalization capabilities on four different challenging image-to-image conversion tasks: Colorization, which transforms a black-and-white image into a plausible colored image; Inpainting, which fills masked areas with the most realistic content; Uncropping, which expands the input image in multiple directions; and JPEG decompression, which restores a JPEG compressed image. Uncropping expands the input image in multiple directions, and JPEG decompression restores the image after JPEG compression. Although the tasks are different, Palette does not tune hyperparameters, change architecture, or adjust loss functions for each task. The input and output are both 256x256 RBG images.

colorization

In the previous study, Palette uses RGB space while the output uses LAB and YCbCr image space for colouring. Hence, from the results of this study it can be said that RGB is as efficient as YCbCr.

Figure 3 shows the generated images including the comparison method. It can be seen that the Baseline used in this study seems to be better than the previous studies. When we look at the results in terms of evaluation metrics as shown in Table 1, we can see that Palette is close to the reference image, which indicates that the proposed method is effective in colorization.

Inpainting

The Palette uses a shape-free mask as in previous studies. Instead of a binary mask, we use a Gaussian noise mask, which can be calculated by the Denoising Diffusion Model. In addition, the training speed is improved by predicting only the part of the mask.

Figure 4. and Table 2. show the generated images and the quantitative experimental results, respectively, showing that the results on the ImageNet and Places2 datasets are all better for Palette.

Uncropping

The Palette can be extended in any direction (up, down, left, right) or in all directions. In both cases, the image is masked by 50%, and the masked area is filled with Gaussian noise, as in Inpainting.

The comparison with previous studies and the results are shown in Figure 5. and Table 3. Palette outperforms Baseline on both ImageNet and Places2 datasets. In particular, the high Fool rate indicates that Palette is able to generate realistic images.

JPEG decompression

The Palette is trained on images containing a variety of resolutions, as in previous studies, but while previous studies used more than 10 Quality Factors (QFs), this study uses QFs >=5, which is more difficult to achieve.

Figure 6 and Table 4 show the generated images and the quantitative results. palette performed significantly better than Regression. Also, the smaller the QF (the more difficult the task), the larger the difference between Palette and Regression.

3.5 Sample Diversity

In this section, we investigate the diversity of generated images. A previous study (SR3) showed that L1 (p=1) has higher resolution in the Diffusion Model's objective function (Equation 1), but a detailed analysis has not been performed. The diversity of the generated images in the three tasks is evaluated by the SSIM index; the larger the SSIM, the lower the diversity index.

Figure 8 shows that L2 has lower SSIM and higher diversity, while Figure 7. shows that Palette can generate diverse images for the same input.

Multi-Task Learning

Although multitasking has been studied in many fields, it has not been studied much in the field of imaging. In this section, we compare Palette(Multi-task) trained on multiple tasks at the same time with Palette(Task-specific) trained on only one task at a time. recovery task.

summary

In this paper, we show that the Diffusion Model outperforms GAN on a variety of image-to-image translation tasks, achieving SOTA on four challenging tasks and reiterating the potential of the Diffusion Model in prior In a previous study (Did you beat BiGAN in image generation? About Diffusion Models) on a more diverse set of tasks. In particular, the property of being able to solve a task without incorporating task-specific information contributes to the generalizability of Diffusion Model. Also, for the first time, we have adapted the concept of multitasking to the field of image-to-image transflation, and we look forward to further research in this area.