MicroDiffusion: A Thousand-dollar Generative Image Quality Model That Outperforms Multi-million-dollar Models

Image Generation 25/12/2024

3 main points
✔️ Text-to-Image diffusion models are used in many fields, but require high cost and huge computational resources
✔️ MicroDiffusion uses a new masking technique and improved Transformer architecture to enable low-budget diffusion models MicroDiffusion
✔️ Experimental results show that MicroDiffusion achieves comparable FID and high quality generation at 1/14th the cost of current state-of-the-art models

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
written by Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, Lingjuan Lyu
(Submitted on 22 Jul 2024)
Comments: 41 pages, 28 figures, 5 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Modern image generation models excel at creating natural, high-quality content, generating more than one billion images per year. However, training these models from scratch is extremely expensive and time consuming. Text-to-image (T2I) diffusion models have reduced some of the computational costs, but still require significant resources.

Current state-of-the-art technology requires approximately 18,000 A100 GPU hours, and training with eight H100 GPUs takes over a month. In addition, it often relies on large or proprietary data sets, making widespread use difficult.

In this commentary paper, we aim to develop a low-cost, end-to-end text-to-image diffusion model pipeline to significantly reduce costs without large data sets. We focus on vision transformer-based latent diffusion models, leveraging their simple design and broad applicability. To reduce computational cost, we employ a method that reduces the number of patches to be processed per image by masking input tokens randomly. The challenges of existing masking methods, which degrade performance at high masking ratios, are overcome inthis paper.

To overcome the poor performance in the text-to-image diffusion model, this paper proposes a "delayed masking" strategy. By processing patches in a lightweight patch mixer and then inputting them to a diffusion transformer, we achieve reliable training at low cost while preserving semantic information even at high masking ratios.It also incorporates the latest advances in transformer architecture to improve performance in large-scale training.

The experiment trained a 116-million-parameter sparse diffusion transformer with a budget of only $1,890, 37 million images, and a masking ratio of 75%. The result was an FID of 12.7 for zero-shot generation on the COCO dataset. Training time was only 2.6 days on a single 8×H100 GPU machine, a 14-fold reduction compared to the current state-of-the-art approach (37.6 days, $28,400 GPU cost).

Proposed Method

Delayed Masking

Since the computational complexity of a transformer is proportional to the length of the sequence, one way to reduce training costs is to reduce the sequence by using large patch sizes, as shown in Figure 1-b. Using a large patch size reduces the number of patches per image quadratically, but can significantly degrade performance because it aggressively compresses large regions of the image into a single patch.

There is a way to remove a large number of patches in the input layer of the transformer using masking, as shown in Figure 1-c, while maintaining patch size. This is similar to random crop training in convolutional UNet, but patch masking allows training in non-contiguous regions of the image. This method is widely employed in the visual and language domains.

To also encourage representation learning from masked patches, MaskDiTin Figure 1-dadds an auxiliary self-coding loss that encourages reconstruction of the masked patches. This technique masked 75% of the input image, resulting in a significant reduction in computational cost.

Figure 1: Compressing patch sequences to reduce computational cost

However, high masking ratios significantly degrade the overall performance of the transformer; even using MaskDiT, only marginal improvements can be seen compared to simple masking. This is because even with this approach, the majority of image patches are removed at the input layer.

In this paper, we introduce a pre-processing module called a "patch mixer" to process patch embedding before masking. This allows unmasked patches to retain information about the entire image, improving learning effectiveness. This approach has the potential to improve performance while being computationally equivalent to existing MaskDiT strategies.

Patch Mixer and Learning Loss

A patch mixer refers to any neural architecture that can fuse individual patch embeddings. In the transformer model, this goal is naturally achieved by a combination of attention mechanisms and feedforward layers. Therefore, in this paper, lightweight transformers (only a few layers) are used as patch mixers. After being processed by the patch mixer, the input sequence tokens are masked (Figure 2e). Assuming a binary mask m, we train the model with the following loss function

Transformer architecturewith Mixture-of-experts (MoE) and layer-wise scaling

This paper incorporates innovations in advanced transformer architecture to improve model performance under computational constraints.

Mixture-of-experts (MoE, Zhou et al., 2022): use a MoE layer to extend the parameters and expressiveness of the model while avoiding a significant increase in training costs; a simplified MoE layer with Expert selection routing allows for additional auxiliary loss functions Load can be adjusted without the need for an additional auxiliary loss function.
Layer-wise scaling (Mehta et al., 2024): an approach that has been shown to improve performance on large language models, where the width of the transformer block (dimension of the hidden layer) is increased linearly with depth. Deeper layers are assigned more parameters and learn more complex features.

The overall architecture is shown in Figure 2.

Figure 2: Overview of the entire proposed method

Experiment

Verification of The Effects of Delayed Masking and Patch Mixer

Masking performance can degrade when many patches are masked; Zheng et al. (2024) noted that MaskDiT performance drops significantly when masking ratios exceed 50%. This paper evaluated performance at masking ratios up to 87.5% and compared it to a conventional naive masking method that does not use a patch mixer. The "delayed masking" in this paper uses a 4-layer transformer block patch mixer, which is less than 10% of the backbone transformer parameters. Both used the AdamW optimizer with identical settings.

The results are summarized in Figure 3. Delayed maskingsignificantly outperformednaive masking andMaskDiTon all metrics, indicating that the performance difference widens as the masking ratio increases. For example, at a masking ratio of 75%, naive maskingachievedan FID score of 80 andMaskDiT a score of 16.5, while the proposed approach achieved 5.03, which compares favorably with an FID score of 3.79 without masking.

Figure 3: Verification of the effects of delay masking and patch mixer

Mixture-of-Experts and Layer-Wise Scaling Effect Verification

Layer-wise Scaling: experiments using the DiT-Tiny architecture compared layer-wise scaling and constant-width transformers with naive masking. Both models were trained for the same period of time with the same computational load. The layer-by-layer scaling approach consistently outperformed the constant-width model on all performance measures and was shown to be more effective in masking training.

Mixture-of-Experts (MoE): a DiT-Tiny/2 transformer with MoE layers in alternating blocks was tested. Overall performance was similar to the baseline model without MoE layers, with a slight improvement in Clip-score (from 28.11 to 28.66) and a deterioration in FID score (from 6.92 to 6.98). The reason for the limited improvement is due to the 60K step training and the small sample size seen by each expert.

Comparison with Previous Studies

Zero-shot image generation on the COCO dataset (Table 1): 30,000 generated images were generated from captions and compared their distribution with real images using FID-30K.Theproposedmethod achieved an FID-30K score of 12.66, which is 14 times less computationally expensive than prior low-cost training methods and is not dependent on proprietary datasets. It also showed superior performance at 19 times less computational cost than Würstchen (Pernias et al., 2024).

Table 1: Zero-shot image generation on the COCO dataset

Detailed image generation comparison(Table 2): we used GenEval (Ghosh et al., 2024) to evaluate its ability to generate object position, co-occurrence, number, and color.The proposed method showed near-perfect accuracy in single object generation, comparable to the Stable-Diffusion variant and outperforming Stable-Diffusion-1.5; compared to Stable-DiffusionXL-turbo and the PixArt-α model, it showed superior performance in color attribution The Stable-DiffusionXL-alpha model also showed superior performance in color attribution.

Table 2. detailed image generation comparison

Summary

In this commentary paper, we focus on patch masking strategies aimed at reducing computational costs in training diffusion transformers. A "delayed masking" strategy is proposed to mitigate the shortcomings of existing masking approaches, and significant performance improvements are demonstrated for all masking ratios.

In particular, a 75% delay masking ratio was used to perform large-scale training using a combination of real and synthetic image data sets. Despite the significantly lower cost compared to state-of-the-art techniques, the zero-shot image generation performance achieved competitive results. It is hoped that this low-cost training mechanism will encourage more researchers to participate in the training and development of large-scale diffusion models.