Diffusion2GAN: Knowledge Distillation Of Diffusion Models Into Conditional GANs

Image Generation 26/08/2024

3 main points
✔️ Diffusion models show excellent image generation quality on difficult datasets such as LAION, but high-quality results require tens to hundreds of sampling steps
✔️ Proposes a one-step method for high-quality generative models with a policy of knowledge distillation of diffusion models into conditional GANs
✔️Proposes anewknowledge distillation loss and discriminator toimprove the quality of generated images

DDistilling Diffusion Models into Conditional GANs
written by Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park
(Submitted on 9 May 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Diffusion modelshave shownexcellent imagegeneration qualityon difficult data sets such as LAION,but require tens to hundreds of sampling steps for high-quality results. Therefore, models such as DALL-E 2, Imagen, and Stable Diffusion have high latency and are difficult to interact with in real-time. If a one-step model were available, it would improve the user experience of text-to-image generation and could be used in 3D and video applications.

A simple solution is to train a one-step model from scratch, and while GAN is a good one-step model in a simple domain, there are still challenges in text-to-image generation on large, diverse data sets. This challengeis due to the need to performtwo tasksunsupervised: finding a correspondence between noise and natural images and optimizing the noise-to-image mapping.

The idea of this commentary paper is to solve these tasks one at a time. First, find the correspondence between noise and image in a pre-trained diffusion model, and then let the conditional GAN perform the noise-to-image mapping in a paired image transformation framework. This approach allows us to combine the high quality of the diffusion model with the advantages of the rapid mapping of the conditional GAN.

In our experiments, we used the proposed Diffusion2GAN framework to distill Stable Diffusion 1.5 into a single-step conditional GAN model. diffusion2GAN responds to noise inherent in the target diffusion model to the image better than other distillation methods to the image better than other distillation methods. Furthermore, it outperformed UFOGen and DMD and showed superior results in the COCO2014 benchmark. In particular, it outperformed SDXL-Turbo and SDXL-Lightning in SDXL distillation.

Proposed Method

Overall Overview

An overview of the entire proposed method is shown in Figure 1. First, the output latent variables and input noise and prompts of the diffusion model are collected. The generator is then trained to map the noise and prompt to the target latent variable using E-LatentLPIPS loss and GAN loss.The output of the generator can be decoded into RGB pixels, a computationallyexpensive operation that is not performed during training;E-LatentLPIPS loss and GAN loss are discussed in detail in the next section.

E-LatentLPIPS loss

The loss for traditional knowledge distillation is as in the following equation. This loss can be used as is (Figure 2-b), but since it was designed for pixel space, it must be changed from latent space to pixel space in Decoder.

Since this is a computationally expensive operation, a method is needed to directly compute the perceptual distance in latent space without the need for decoding to pixels.

To achieve this, we trained the VGG network on ImageNet in the latent space of Stable Diffusion, following the procedure of Zhang et al. The latent space was already downsampled by a factor of 8, so the architecture was slightly modified. We then linearly calibrated the intermediate features using the BAPPS dataset to obtain a function that works in latent space.

$d_{LatentLPIPS}(x_0, x_1) = ℓ(F(x_0), F(x_1))$

Figure 2: Effectiveness verification of the proposed loss function on reconstructed images

However,as shown in Figure 2-c,when LatentLPIPS is applied directly as a new loss function for distillation, one can see that wavy patchy artifacts are produced.

Inspired by E-LPIPS, random differentiable extensions, general geometric transformations, and cutouts are applied to both the generated and target latent variables. At each epoch, a random dilation is applied to both the generated and target latent variables. When applied to single image optimization, the Ensemble strategy reconstructs the target image almost perfectly (see Figure 2-d). The new loss function is shortened to Ensembled-LatentLPIPS or E-LatentLPIPS, where T is the randomly sampled expansion.

conditional spread discriminator

In the previous section, we showed that diffusion distillation can be framed as a paired noise to latent variable transformation task. Inspired by the effect of conditional GANs for pairwise image-to-image transformations, we use a conditional discriminator. The conditions for this discriminator include the text description $c$ as well as the Gaussian noise $z$ provided to the generator. The new discriminator incorporates the aforementioned conditions while utilizing pre-trained diffusion weights. Specifically, the following minimum-maximum objective function is optimized for the generator $G$ and discriminator $D$. The discriminator is outlined in Figure 3.

Figure 3. multi-scale conditional discriminator design

experiment

Comparison with distilled diffusion model

Comparing Diffusion2GAN with the state-of-the-art diffusion distillation model, results for COCO2014 and COCO2017 are shown in Tables 1 and 2; InstaFlow-0.9B achieves FID 13.10 for COCO2014 and FID 23.4 for COCO2017, while Diffusion2GAN achieves FID 9.29 and FID 19.5. Other diffusion models are also used, but Diffusion2GAN is trained to closely follow the trajectory of the original model, mitigating the diversity collapse problem while maintaining high visual quality, while ADD-M uses the ViT-g-14 text encoder and achieves high CLIP-5k scores, Diffusion2GAN is trained without this encoder.

Table 1. comparison with recent text-to-image models at COCO2014.

Table 2. comparison with recent text-to-image models at COCO2017.

Visual Analysis

Figure 4: Visual comparison with Stable Diffusion 1.5 Teacher

Figure 4 visually compares the proposed method with Stable Diffusion 1.5, LCM-LoRA, and InstaFlow. Since diffusion models tend to produce more photorealistic images as the classifier-free guidance (CFG) scales up, the proposed method trains Diffusion2GAN using the SD-CFG-8 dataset and uses the same guidance scale of 8 to compare it to Stable Diffusion 1.5 is compared to Diffusion 1.5; for LCM-LoRA and InstaFlow, the respective best settings are followed to ensure a fair comparison. The results of the proposed method show that it produces images more realistically compared to other distillation baselines, while at the same time preserving the overall layout of the target image produced by the Stable Diffusion teacher.

Speed of Training

Even including the preparation cost of the ODE dataset, Diffusion2GAN converges more efficiently than existing distillation methods: on the CIFAR10 dataset, we compare the total number of function evaluations of the generator network during total training.We can see that theproposedmethod already exceeds Consistency Distillation's FID for training with LPIPS loss at 500k supervised output (Table 3). For text-to-image synthesis, the full version of Diffusion2GAN achieves better FID than InstaFlow and with far fewer GPU days (Table 4).

Table 3. comparison of convergence in CIFAR10

Table 4: Comparison of required computing resources and image quality

Summary

We have proposed a new framework, Diffusion2GAN, which distills a pre-trained multi-step diffusion model into a single-step generator trained on conditional GANs and perceptual loss. The proposed method shows that splitting the generative modeling into two tasks (identifying correspondences and learning mappings) improves performance and computational efficiency using different generative models.We believe thatthis simple approach not only improves interactive image generation, but also efficient video and 3D applications.

However, the proposed method has several limitations: first, it uses a fixed guidance scale, which means that different values cannot be used during inference; second, it relies on the quality of the teacher model, which limits its performance; third, increasing the size of the student and teacher models still leads to diversity loss. Third, increasing the size of the student and teacher models still results in a decrease in diversity. Future research will make the model more practical if these three issues can be resolved.