Wavelet Diffusion: The Fastest Diffusion Model

Image Generation 16/04/2024

3 main points
✔️ Diffusion models outperform GANs with high output image quality and high diversity, but are difficult to use in real-time due to very slow inference speed
✔️ DiffusionGAN, a previous study, could be combined with GAN mechanisms to significantly increase estimation speed
✔️ Based on DiffusionGAN, the highest speed among diffusion models was achieved by converting to low and high frequency components, compressing the input by a factor of 4, while maintaining high image quality

Wavelet Diffusion Models are fast and scalable Image Generators
written by Hao Phung, Quan Dao, Anh Tran
(Submitted on 29 Nov 2022 (v1), last revised 22 Mar 2023 (this version, v2))
Comments: Accepted to CVPR 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Diffusion models have recently emerged and are growing rapidly, attracting the interest of many researchers. These models generate high-quality images from random noisy inputs. In particular, they outperform state-of-the-art generative models, GANs, in image generation tasks. Because of the flexibility of diffusion models to handle a variety of conditional inputs, a wide range of applications are possible, including text-to-image generation, image-to-image conversion, and image restoration. This makes them promising for applications in AI-based digital art and other domains.

Diffusion models have great potential, but their very slow estimation speed has hindered their widespread adoption as GANs. The basic diffusion model takes several minutes to obtain the desired output quality. Although many studies have been proposed to reduce estimation time, even the fastest algorithms take several seconds to produce a 32 x 32 image. diffusionGAN has dramatically improved estimation time by combining a diffusion model with GAN, but it is still not suitable for large-scale or real-time applications. not suitable for real-time applications.

For use in real-time applications, this commentary paper proposes a new diffusion called WaveletDiffusion. By using a discrete wavelet transform and converting it into low and high frequency components, the input is compressed by a factor of 4, significantly reducing inference time. We also propose a wavelet-specific generator to more efficiently exploit wavelet features and maintain output quality. Experimental results confirm that WaveletDiffusion achieves the highest speed among diffusion models while maintaining high image quality.

For convenience, DiffusionGAN will be abbreviated as DDGAN in later sections.

Proposed Method

Wavelet-based diffusion scheme

In this paper, the input image is decomposed into four wavelet subbands, which are then concatenated into a diffusion process as a single object (shown in Figure 1). Such a model operates on the wavelet spectrum rather than the original image space. As a result, the model can take advantage of high-frequency information to add more detail to the generated image. Meanwhile, the wavelet subbands are four times smaller than the original image, greatly reducing the computational complexity of the sampling process.

The method in this paper is based on the DDGAN model and the input is the four wavelet subbands of the wavelet transform. Given an input image x ∈ R 3 × H × W , it is decomposed into low and high subbands, which are further concatenated to form a matrix y ∈ R 12 × H 2 × W 2 . This input is projected onto the base channel D via the first linear layer, leaving the width of the network unchanged compared to DDGAN. Thus, most networks will have their computations greatly reduced due to the four reduced spatial dimensions.

Learning loss function

・Hostile Losses

Like DDGAN, it optimizes generators and discriminators through adversarial loss:

・Reconstruction loss and overall loss function

In addition to the hostile loss above, a reconstruction term is also added to prevent loss of frequency information and also preserve wavelet subband consistency. This is formulated as the L1 loss between the generated image and its ground truth.

The overall objective of the generator becomes the next linear combination of the adversarial and reconstructive losses:

where λ is the weighted hyperparameter. After a defined number of sampling steps, the estimated denoised subband y'0 is obtained. The final image can be recovered using the wavelet inverse transform: x'0 = IWT(y'0).

Generator incorporating wavelets

Figure 2 shows the structure of the proposed wavelet embedding generator. The proposed generator follows a UNet structure with M downsampling blocks and M upsampling blocks. There are also skip connections between blocks of the same resolution. However, instead of the usual downsampling and upsampling operators, frequency sensitive blocks are used.The lowest resolution employs frequency bottleneck blocks to pay better attention to low and high frequency components.

Finally, a frequency residual connection using a wavelet downsample layer is introduced to incorporate the original signal Y into the different feature pyramids of the encoder. Here, Y denotes the input image and Fi denotes the i-th intermediate feature map of Y.

Frequency-aware downsampling and upsampling blocks

Traditional approaches use blurring kernels in the downsampling and upsampling process to reduce aliasing artifacts. Instead, this paper takes advantage of the inherent properties of the wavelet transform to better upsample and downsample (shown in Figure 3).

This enhances the recognition of high-frequency information in these operations. In particular, the downsampling block takes a tuple of input features Fi, potential z, and time embedding t and processes them through a series of layers to return downsampled features and high-frequency subbands. These returned subbands serve as additional input for upsampling the features based on frequency cues within the upsampling block.

Figure 3. overview of frequency-aware downsampling and upsampling blocks

Experiment

Data-set

Experiments were performed on 32x32 CIFAR-10, 64x64 STL-10, and 256x256 CelebA-HQ and LSUN-Church data sets. Experiments were also conducted on CelebA-HQ (512 & 1024) high-resolution images to verify the effectiveness of the Tianan method at higher resolutions.

Valuation index

Image quality is measured by Frechet Inception Distance (FID) and sample diversity is measured by Recall; as with DDGAN, FID and Recall are computed for 50,000 generated samples. Estimation speed is measured by average inference time over 300 trials with a batch size of 100. In addition, inference time for high resolution images such as CelebA-HQ 512 x 512 is calculated from batches of 25 samples.

Experimental results

Figure 4: Example of generation at Celeba-HQ

Comparison results with typical generative models such as VAE, GAN, and diffusion models for each dataset are shown in Tables 1, 2 and 3 Compared toVAE' s SOTA,or strongest model, the proposed method significantly outperformed all evaluation metrics. In particular,FID, whichrepresents image quality, was more than four times higher thanVAE.

Compared toGAN'sSOTA, the estimation speed is at about the same level and the image quality is also higher. With respect to diversity, it is 10% higher thanGAN in all cases.

Compared to the Diffusion modeland DDGAN, the proposed methodachieved the highest estimation speed among theDiffusionmodels. In particular, it is more than 500 times faster than SOTA for Diffusion. Image quality and diversity are also top-ranked, in some cases1~2points higher thanSOTA for diffusion models. It also outperformed the prior study,DDGAN, on all evaluation metrics.

Effectiveness of generators incorporating wavelets

The effect of each individual component of the proposed generator was tested on CelebA-HQ 256x256. Here, the full model includes residual connections, up-sampling, down-sampling blocks, and bottleneck blocks. As shown in Table 4, each component has a positive impact on the model performance. The best performance is achieved with 5.94 by applying all three proposed components. However, the improvement in performance comes at a slight cost in terms of estimation speed.

Table 4: Effects of generators incorporating wavelets.

Execution time for generating one image

Moreover, as expected in a real application, the proposed method in a single image shows excellent speed. Table 5 shows the time and key parameters. The proposed method is able to generate images up to 1024 x 1024 in only 0.1 second, which is the first diffusion model to achieve near real-time performance.

Figure 5.Estimation time for a single image generated using our full model for each benchmark set

Conclusion

We introduced a new diffusion model called Wavelet Diffusion, which shows excellent performance in both image quality and sampling rate. By incorporating wavelet transforms in both the image and feature space, the proposed method achieves state-of-the-art execution speed in diffusion models, closing the gap with SOTA in GANs and obtaining nearly comparable image generation quality compared to StyleGAN2 and other diffusion models. Furthermore, the proposed method provides faster convergence than the baseline DDGAN, confirming the efficiency of the proposed framework.