[LDDGAN] Diffusion Model With The Highest Speed Inference

Diffusion Model 29/09/2024

3 main points
✔️ Diffusion models outperform GANs in image quality, diversity, and learning stability, but are difficult to use in real-time due to very slow inference speed
✔️ DiffusionGAN and WDDGAN in prior work have significantly improved inference speed, but are still slow compared to GANs and produce lower image quality However, they are still slower than GANs and have the following issues
✔️ LDDGAN utilizes adversarial learning of GANs in low-dimensional latent space to maintain high image quality and diversity, and is the fastest of the diffusion models

Latent Denoising Diffusion GAN: Faster sampling, Higher image quality
written by Luan Thanh Trinh, Tomoki Hamagami
(Submitted on 17 Jun 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Diffusion models have been touted as a powerful method for generating high-quality, diverse images, often outperforming GANs. However, their slow inference speed makes them difficult to apply in real-time. To address this issue, DiffusionGAN utilizes conditional GAN to significantly reduce the number of noise removal steps and improve inference speed. Its improved version, Wavelet Diffusion, further speeds up the process by transforming the data into wavelet space. However, these models were still slower than GANs, and image quality was also degraded.

To fill these gaps, this commentary paper proposed Latent Denoising Diffusion GAN (LDDGAN). This model compresses images into a compact latent space using pre-trained autoencoders to significantly improve inference speed and image quality. We also proposed a Weighted Learning training strategy to increase diversity and image quality.

Experimental results on the CIFAR-10, CelebA-HQ, and LSUN Church datasets show that LDDGAN achieved the fastest execution speed among diffusion models: diffusionGAN (DDGAN) and wavelet diffusion (WDDGAN). Compared to previous studies, LDDGAN shows significant improvements in all evaluation metrics. In particular, the diversity of the generated images significantly outperformed GAN while maintaining inference speed and image quality comparable to GAN.

Proposed Method

Overall Overview

An overview of LDDGAN is shown in Figure 1 and includes the following four steps. (i) A pre-trained Encoder transforms the input image into a low-dimensional latent variable. Here, there is no limitation of 4x compression as in WDDGAN, but 8x and 16x are also possible. (ii) Performs a diffusion process and allows multimodal distributions as well as Gaussian distributions. The number of sampling times T is set to T ≤ 4 instead of hundreds or thousands as in conventional Diffusion. (iii) The generator performs the inverse diffusion process by learning to predict the multimodal distribution of the inverse transform based on the discriminator feedback. (iv) A pre-trained decoder reconstructs the original image by transforming from latent variables to pixel space.

The first two expected benefits of LDDGAN are that it compresses as many input images as possible, thus significantly reducing the computational cost of training diffusion models, and faster inference than previous studies. In addition, the low-dimensional latent space is optimal for Diffusion, a likelihood-based generative model, and also allows for improved output image quality and diversity.

Learning Autoencoder

The structure of the LDDGAN autoencoder is based on the VQGAN proposed by Esser et al. Its unique feature is the incorporation of a quantization layer within the decoder. Conventional methods typically use a Kullback-Leibler (KL) divergence penalty in the loss function of the autoencoder. This approach encourages the learned latent space to approximate a normal distribution and has proven effective when the model's learning policy relies heavily on a Gaussian distribution. However, LDDGAN is not limited to normal distributions, but also allows for complex, multimodal distributions. Therefore, we do not use this KL penalty and allow the autoencoder to freely use the latent space. This allows the system to prioritize the ability to compress and restore images.

The results in Table 1 demonstrate the validity of this hypothesis. The freedom to explore the appropriate latent space significantly improves the results in most cases.Of particular note are the results on the CELEBA-HQ dataset. Here, the main model achieved better FID and Recall, despite using an autoencoder with inferior reconstructed FID than the autoencoder with KL penalty.

Table 1. comparison of autoencoder learning spaces

Learning Loss and Weighted Learning

The adversarial loss for LDDGAN generators and discriminators is given by the following equation

When learning only with adversarial loss, it is possible to generate images that look exactly like the real data, but convergence is slow because the learning is done indirectly through the discriminator. Therefore, to facilitate the convergence of the training of the generator, we also introduced reconstruction loss, which represents the difference between the original image and the generated image, as shown in the equation below.

When dealing with multiple loss functions, traditional methods use a linear combination with fixed parameters to synthesize the final loss. This means that the importance of the reconstruction loss is kept constant. However, reconstruction loss imposes a constraint to produce the same data as the input data but with different noise, which may reduce the diversity of the samples produced. Therefore, Weighted Learning was proposed in LDDGAN, as shown in the following equation. Figure 2 shows one example of Weighted Learning.

First, in the early stages of learning, the importance of reconstruction losses is set to almost 1 to promote convergence. Then, as learning progresses, this importance is gradually reduced and sample diversity is increased by prioritizing adversarial losses. Toward the end of the learning process, the rate of decrease in reconstruction loss is moderated to prioritize overall stability. This approach is expected to result in faster learning convergence while maintaining image quality, diversity, and learning stability.

The experimental results in Table 2 confirm this hypothesis. Employing reconstruction loss resulted in better image quality (FID) in both data sets compared to relying solely on adversarial loss. However, on the other hand, diversity (Recall) was reduced. In contrast, the adoption of Weighted Learning resulted in improvements in both image quality and diversity.

Experiment

Data Sets and Evaluation Indicators

To test the effectiveness of LDDGAN,experiments were conducted on thelow-resolution Cifar10(32x32size)and the high-resolution Celeba-HQ and LSUN(256x256 size)data sets. The evaluation metrics used were inference time, Fréchet Inception Distance (FID) for image quality, and Recall for diversity. For inference time, the process of generating a batch of 100 images was run for 300 trials, and the average time was measured.

Comparison with Previous Studies

The results in Tables 3, 4, and 5 show that LDDGAN further improves on the weaknesses of the diffusion model, achieving state-of-the-art execution speed among diffusion models while maintaining high image quality and diversity.

There are other diffusion models that achieve better FID than LDDGAN, such as SDE Score and DDPM. However, LDDGAN achieves sampling speeds 5000 times faster than Score SDE and 1000 times faster than DDPM, demonstrating its overwhelming superiority in terms of speed.

Particularly noteworthy are the comparison results with the previous studies DDGAN and WDDGAN. The proposed method outperforms these methods in all evaluation metrics.

Furthermore, when compared to StyleGAN, which is considered the SOTA of GANs, LDDGAN was found to significantly outperform in terms of diversity while achieving comparable image quality and inference speed.

Figure 3 is a qualitative and comparison; LDDGAN clearly achieves better sample quality; on the CelebA-HQ dataset, both DDGAN and WDDGAN struggle to produce clear, complete human faces, often producing distorted features. Similarly, on the LSUN Church dataset, these models have difficulty accurately depicting the linear and horizontal details of buildings. In contrast, LDDGAN consistently produces realistic and crisp images.

Summary

This article introduced a new diffusion model, LDDGAN, which leverages adversarial learning of GANs in a low-dimensional latent space to maintain high image quality and diversity while being the fastest of all diffusion models.

Notably, the results of the comparison with StyleGAN, which is considered the highest level of GAN (SOTA), confirmed that LDDGAN achieves the same image quality and inference speed as StyleGAN, while significantly outperforming it in terms of diversity.

On the other hand, there is a possible disadvantage in that AutoEncoder may be a limitation to the overall performance of the model. Looking ahead, it is expected that improvements to AutoEncoder and investigation of generator structures specific to latent space will further improve the effectiveness of LDDGAN.