Catch up on the latest AI articles

U-ViT: ViT Backbone For Diffusion Models

U-ViT: ViT Backbone For Diffusion Models

Image Generation

3 main points
✔️ Diffusion model outperformstraditionalGANs in image generation tasks
✔️ Diffusion model mainly uses CNN-based UNet and improves performance by introducing ViT backbone

✔️ ViT-based UNet achieves highest FID for image generation on ImageNet and MS-COCO

All are Worth Words: A ViT Backbone for Diffusion Models
written by Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, Jun Zhu
(Submitted on 25 Sep 2022, last revised 25 Mar 2023)
Accepted to CVPR 2023. Published on arxiv.
Subjects: omputer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Diffusion models are powerful deep generative models that have recently emerged for high-quality image generation. Diffusion models are growing rapidly and are being applied to text-to-image generation, image-to-image generation, video generation, speech synthesis, and 3D synthesis.

Along with algorithmic improvements, backbone improvements play an important role in diffusion models. A prime example is the convolutional neural network (CNN)-based U-Net, which has been used in previous studies.CNN-based UNet is characterized by a series of downsampling blocks, a series of upsampling blocks, and long skip connections between these groups The UNet is based on a diffusion model of the image generation task. It plays a dominant role in the diffusion model of the image generation task.

Vision transformers (ViTs), on the other hand, have shown promising results in a variety of vision tasks; in some cases, ViTs perform as well as or better than CNN-based approaches. This raises a natural question: is there a need to rely on CNN-based U-Nets in diffusion models?

In this commentary paper, we propose UNet (U-ViT) based on ViT. The proposed methodachieved the highest FID (a measure of image quality) for image generation on ImageNet and MS-COCO.

Proposed Method

Figure 1: Overview of U-ViT

An overview of U-ViT is shown in Figure 1. The network takes the time \(t\), condition \(c\), and noisy image \(x_t\) of the diffusion process and predicts the noise to be injected into \(x_t\).Following ViT's design approach, the image is divided into patches, and U-ViT treats all inputs including time, condition, and image patch and U-ViT treats all inputs as tokens (words), including time, condition, and image patches.

Like CNN-based U-Net, U-ViT uses long skip connections between shallow and deep layers. Training diffusion models is a pixel-level prediction task and is sensitive to low-level features. Long skip connections provide shortcuts to low-level features and facilitate training of noise prediction networks.

In addition, U-ViT optionally adds a 3x3 convolution block before output. This is to prevent potential artifacts in the image produced by the transformer.

Each part of the U-ViT is described in detail in the following subsections.

Mounting Details

In this section, the structure of the U-ViT is optimized through the image quality (FID) of the generated images in CIFAR10. A summary of the overall results is shown in Figure 2.

Figure 2: Optimization of the structure of U-ViT

How to combine long skip connections

Let $h_m, h_s \in \mathbb{R}^{L \times D}$ be the embedding from the main branch and the long skip branch. We will consider several ways to combine them before feeding into the next transformer block:

1. concatenate $h_m, h_s $ and then do a linear projection (see Figure 1): $\text{Linear}(\text{Concat}(h_m, h_s))$.

2. add $h_m, h_s$ directly: $h_m + h_s$.

3. make a linear projection on $h_s$ and add it to $h_m$: $h_m + \text{Linear}(h_s)$.

4. add $h_m, h_s$ and then do a linear projection: $\text{Linear}(h_m + h_s) $.

5. remove long skip connections

As shown in Figure 2(a), among them, the first method using theconnection$\text{Linear} \text{Concat}(h_m, h_s)$showed the best results. In particular, we were able to significantly improve the quality of the generated image compared to that without long skip connections.

How to enter time conditions

We consider two ways to input the time condition $t$ into the network. Method(1 ) is to treat them as tokens, as shown in Figure 1. Method (2) is to incorporate time after layer normalization in the transformer block, which is similar to the adaptive group normalization used in U-Net; the second method is called adaptive layer normalization (AdaLN). As shown in Figure 2(b), method(1), which treats time as a token, performs better than AdaLN.

How to add a convolution block after the transformer

There are two ways to add convolution blocks after the transformer. (1) is to add a 3 × 3 convolution block after the linear projection that maps the token embedding to the image patch (as shown in Figure 1). (2)isto add a 3 × 3 convolution block before this linear project ion. Further, compare this with the case where the additional convolution block is removed. As shown in Figure 2(c), method(1), which adds a 3×3 convolution block after the linear project ion, performs slightly better than the other two alternatives.

Patch Embedding Method

Traditional patch embedding is a linear projection that maps a patch to a token embedding (as shown in Figure 1). Instead of this method, we also considered a method of mapping an image to a token embedding using a stack of 3 × 3 convolutional blocks followed by a 1 × 1 convolutional block. However, as shown in Figure 2(d), traditional patch embedding performs better, so we use this method for the final model.

Method of Position Embedding

In this paper, we use the 1D learnable position embedding proposed in the original ViT. Although a two-dimensional sinusoidal position embedding is an alternative, the one-dimensional learnable position embedding performs better, as shown in Figure 2(e). We also tried not using position embedding, but the model failed to produce crisp images, which indicates that position information is important for image generation.

Impact of network depth, width, and patch size

Figure 3: Effects of network depth, width, and patch size

We now investigate the scaling properties of U-ViT with CIFAR10 to examine the effects of number of layers, width, and patch size. As shown in Figure 3, increasing the number of layers from 9 to 13 improves performance, but no effect is seen for models deeper than 17. Similarly, increasing the width improves performance, but beyond a certain width, no effect is seen.

Smaller patch sizes improve performance, but performance degrades below a certain size. Small patch sizes are considered suitable for low-level noise prediction tasks in diffusion models. On the other hand, the use of small patch sizes is costly for high-resolution images, so the images must first be converted to low-dimensional latent representations, which are then modeled by U-ViT. More details are given in the Experiments section.


Data Sets and Settings

The effectiveness of U-ViT is tested inthree tasks: unconditional image generation, class-conditional image generation, and text-to-image generation.

Experiments for unconditional image generation are performed on CIFAR10 (50,000 images) and CelebA 64×64 (162,770 images). For class-conditional image generation, experiments will be conducted on 64×64 and 256×256ImageNet datasetscontaining 1,281,167 training images from 1,000 different classes, and a 512×512 resolution dataset. For text-to-image training, we use MS-COCO (82,783 training images and 40,504 validation images).

For the generation of256 × 256 and 512 × 512high-resolution images, a pre-trained image autoencoder provided byLatent diffusion models (LDM) [Rombach, 2022] is used to generate 32 × 32 and 64 × 64 resolution latent representations, respectively into a 32 × 32 and 64 × 64 resolution latent representation, respectively. U-ViT is then used to model these latent representations.

For text-to-image generation in MS-COCO, discrete text is converted into a sequence of embeds using the CLIP text encoder, and these embeds are entered into U-ViT as a sequence of tokens.

Unconditional and class-conditional image generation

Table 1. unconditional and class-conditional image generation results

Here we compare the U-ViT to the previous U-Net based diffusion model and GenViT, a smaller ViT that does not have long skip connections and incorporates time before the normalization layer.FID scores were used to measure image quality.

As shown in Table 1, U-ViT performed comparably to U-Net and better than GenViT on unconditional CIFAR10 and CelebA 64×64. For the class-conditional ImageNet 64×64, we first tried the U-ViT-M configuration with 131M parameters.As shown in Table 1, this yielded an FID of 5.85, better than the IDDPM of 6.92 using U-Net with 100M parameters.For further performance improvement, a U-ViT-L configuration with 287M parameters was employed, which improved the FID from 5.85 to 4.26.

For ImageNet 256 x 256 with class condition, U-ViT achieved the highest FID of 2.29, outperforming the previous diffusion model. Table 2 shows that U-ViT outperforms LDM at various sampling steps using the same sampler. U-ViT also outperformed VQ-Diffusion, a discrete diffusion model with a transformer backbone. Similarly, U-ViT outperforms UNet with the same parameters and computational cost.

For ImageNet 512×512 with class condition, U-ViT outperformed ADM-G, which directly models the pixels of the image. Figure 4 shows selected samples of ImageNet 256x256 and 512x512, as well as random samples of other datasets,confirming that the images arehigh quality andveryclear.

Table 2. FID results for different number of sampling steps for ImageNet256×256
Figure 4. example of generation

Image generation from text with MS-COCO

Here, we evaluate U-ViT on a text-to-image generation task using the MS-COCO dataset. We also train another latent diffusion model employing U-Net with the same model size as U-ViT and compare it to U-ViT.

We use FID scores to measure image quality: we randomly select 30K prompts from the MS-COCO validation set and generate samples with these prompts to compute FID. As shown in Table 3, U-ViT achieves state-of-the-art FIDs even without the need to access large external data sets during training of the generated model.By increasing the number of layers from 13 to 17, U-ViT-S (Deep) achieves even better FIDs.

Figure 6 shows samples generated from U-Net and U-ViT using the same random seed for qualitative comparison; U-ViT produces higher quality samples and the image content matches the text better.

For example, given the text "a baseball player swinging a bat at a ball," U-Net does not generate a bat or ball, but U-ViT does, and U-ViT-S (Deep) even generates a bat. This is likely due to the more frequent interactions at each layer between text and image in U-ViT than in U-Net.

Table 3. experimental results with MS-COCO
Figure 5: Example of text-to-image generation


In this article, we propose U-ViT, a simple and versatile ViT-based architecture for image generation using diffusion models. u-ViT treats all inputs (time, condition, noisy image patches) as tokens and employs long skip connections between shallow and deep layers. U-ViT was evaluated on tasks such as unconditional and conditional image generation and text-to-image generation.

U-ViT performs as well or better than CNN-based U-Nets of similar size. These results suggest that long skip connections are important for diffusion-based image modeling and that the down/up sampling operators of CNN-based U-Nets are not always required.

U-ViT could inform future research on the backbone of diffusion models and benefit generative modeling in large data sets with diverse modalities.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us