GenTron: Diffusion Transformers For Image And Video Generation

Image Generation 26/08/2024

3 main points
✔️ While transformers are widely used in many fields, diffusion models, the strongest models for image generation, mainly utilize CNN-based U-Net
✔️ Proposed GenTron, a transformer-based diffusion model
✔️ In addition to general metrics, it outperforms the state-of-the-art diffusion model SDXL in human evaluation

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
written by Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua
(Submitted on 7 Dec 2023)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Diffusion models have made remarkable progress in a wide variety of content creation areas, including image generation, video production, audio synthesis, and code generation. However, these fields typically use convolutional U-Net architectures. Therefore, it is expected that even higher quality image and video generation can be achieved by applying transformers, which are dominant in the fields of natural language processing and computer vision recognition.

In this commentary paper,we propose GenTron,adiffusion model thatutilizes transformers.Themain policyis to improve Diffusion Transformers (DiTs).First, we extend the functionality from classes to text-conditional image generation. We will also take advantage of the scalability of the transformer architecture to significantly scale GenTron to improve visual quality. In addition, GenTron will evolve from an image generation model to a video generation model, adding a temporal self-attention layer to each transformer block, thus proposing a transformer for video diffusion models. It also proposes new motion-free guidance to improve video quality.

In the experiment, in addition to common metrics, GenTron outperformed SDXL, the diffusion model SOTA, in human ratings, achieving a 51.1% win rate in visual quality (19.8% draw rate) and a 42.3% win rate in text alignment (42.9% draw rate).

Proposed Method

Image generation from text

Image generation from text (T2I)involves two important elements. First, the choice of a text encoder to convert the raw text into text embeddings, and second, how these embeddings are integrated into the diffusion process.

With respect to text encoders,representative models include Text Tower and CLIP from the multimodal model and Flan-T5, a large-scale language model. In order to test the effectiveness of these language models, this paper integrates each model independently into GenTron and evaluates the performance of each.

Figure 1. Integrated architecture for text embedding

With respect to the integration of embedding text into the diffusion process, we consider two methods in Figure 1, the first being Adaptive layernorm (adaLN). As shown in Figure 1.a, it integrates the conditional embedding as a normalization parameter of the feature channel, similar to adaLN, which is widely used in conditional generative models such as StyleGAN.

The second technique is Cross-attention. As shown in Figure 2b, the image feature serves as the query and the text embedding serves as the key and value. This setup allows direct interaction between image features and text embedding through the attention mechanism.

Scaling up the model

With respect to model scale-up, we focused on extending three key aspects: the number of transformer blocks (depth), the dimension of patch embedding (width), and the hidden dimension of MLP (MLP width).The specifications and structure of the GenTron model are detailed in Table 1. In particular, the GenTron-G/2 model has over 3 billion parameters. This is the largest transformer-based diffusion model ever developed.

Table 1: GenTron model configuration details

Video generation from text

Figure 2: Architecture of the video generation model

TempSelfAttn

For the video generation task, the model consists of the transformer blocks shown in Figure 2. Unlike the traditional approach, which adds both a temporal convolution layer and a temporal transformer block to the U-Net, this method integrates only a lightweight temporal self-attention layer (TempSelfAttn) into each transformer block. As shown in Figure 2, the TempSelfAttn layer is placed immediately after the cross-attention layer and before the MLP layer. It further modifies the output by reformatting the output of the cross-attention layer before entering the TempSelfAttn layer and restoring it to its original form after passing through.

Motion-free guidance

The challenge is that focusing on the optimization of temporal aspects while learning to generate video inadvertently compromises the spatial visual quality, which in turn degrades the overall quality of the generated video. To address this challenge, we propose motion-free guidance.Similar to classifier-free guidance, this approach replaces the conditional text with an empty string. The difference is that it uses a unit matrix to invalidate temporal attention with probability p

The unit matrix is shown in Figure 2 (Motion-Free Mask), with the diagonal filled with 1s and all other positions set to 0. This configuration restricts temporal self-attetion to function within a single frame. Furthermore, temporal selfattetion is the sole operator of temporal modeling. Therefore, one can disable temporal modeling in the video diffusion process simply by using a motion-free attention mask.

Experiment

Verify the effectiveness of each ingredient

Table 2: Results of verification of the effects of each component

Cross attention vs. adaLN

Experiments revealed the limitations of adaLN when dealing with free-form text conditioning. This shortcoming is explicitly shown in Figure 3, where adaLN's attempts to generate a panda image are inadequate and Cross attention shows a clear advantage. This is also quantitatively verified in the first two rows of Table 2, where Cross attention consistently outperforms adaLN on all metrics evaluated.

Figure 3. comparison of Cross attention and adaLN

Text Encoder Comparison

Table 2 evaluates the various text encoders in T2I-CompBench. The results show that GenTron-T5XXL performs better than GenTronCLIP-L in three metrics and similarly in the other two. This suggests that the T5 embedding has superior configurability. On the other hand, combining the CLIP-L and T5XXL embeddings improves GenTron's performance, demonstrating the model's ability to take advantage of the different benefits of each text embedding type.

Comparison with previous studies

Table 3. comparison results with previous studies

In this theory, wewill build a final model based on the effects of the combination ofCross attention,CLIP-L and T5XXLtested aboveand compare it with previous studies.

Table 3 shows the alignment evaluation results from T2I-CompBench.Theproposedmethod shows excellent performance in all areas, including attribute bindings, object relationships, and complex compositions. This indicates an increased ability to generate configurations, especially for color bindings. In particular,theproposedmethod outperforms SOTA in prior studies by more than 7%.

Human Evaluation

In Figure 4, we generated 100 images using both the proposed method and LDXL with standard prompts from PartiPrompt2 and asked people's preferences blindly after shuffling. A total of 3000 responses were received regarding visual quality vs. text reliability. The results showed that the proposed method emerged as the clearly superior choice.

Text to video generation results

Figure 5: Example of video generation results.
Prompts used: "Teddy bear walking down 5th Avenue front view beautiful sunset," "A dog swimming," "A giant tortoise is making its way across the beach", and "A dolphin jumping out of the water".

In Figure 5 is the video generated by GenTron-T2V. It is not only visually impressive, but also shows a high quality of consistency in time. In particular, the proposed motion-free guidance is very effective with respect to the consistency of the generated video. As shown in Figure 6, when GenTron-T2V is integrated with MFG, there is a clear indication of a marked tendency to focus on the central object mentioned in the prompt. Specifically, that object is usually more detailed in the generated video, more prominent, occupies a central position, and is the visual focal point throughout the video frame.

Figure 6. experiment to test the effectiveness of motion-free guidance
Prompt: "A lion standing on a surfboard in the ocean in the sunset"

Summary

In this article, we introduced GenTron, a transformer-based diffusion model for image and video generation. By investigating text encoders, ways to integrate embedded text into the diffusion process, and proposing TempSelfAttn and motion-free guidance for video generation,GenTron outperforms the diffusion model SOTA in human evaluation as well as in general evaluation metrics. From these results, GenTroncan be expected to help bridge the gap in applying transformers to diffusion models and facilitate their widespread adoption in a variety of domains.