Latent Diffusion Models Do Not Necessarily "increase In Size"

Diffusion Model 10/07/2024

3 main points
✔️ Smaller models perform better than larger models under the same inference cost
✔️ Similar phenomena regardless of sampler type, downstream task, or distillation
✔️ Important to consider trade-off between sampling cost and model size during inference

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
written by Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar
(Submitted on 1 Apr 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Bigger latent diffusion model does not = better.

The paper is entitled"An Empirical Investigation of the Scaling Properties ofLatent Diffusion Models (LDMs ), with Special Attention to Sampling Efficiency.

The key points of this study are as follows

Issue 1:Lack of research on scaling characteristics and sampling efficiency of LDM
Issue 2: LDM training requires a huge amount of computational resources and is difficult to train with various model sizes.
Issue 3: Relationship between model size and sampling efficiency is unclear.
Solution method:train LDMs with various number of parameters and investigate the relationship between model size and sampling efficiency
Key point: We found a tendency for small models to perform better than large models at small inference costs.

In short, the authors point out that when scaling up LDM, it is important not only to increase the model size, but also to consider the tradeoff with sampling costs during inference.

In particular, " in situations where sampling costs (computational costs) are constrained, smaller models may be more efficient for sampling.

Background on Sampling Efficiency in Latent Diffusion Models

Latent diffusion models (LDMs) have shown excellent performance in a variety of tasks, such as image and video generation. However, it faces a practical challenge of low sampling efficiency.

To improve the sampling efficiency of LDM, the following approaches have been mainly proposed

Develop faster network architecture
Reduced number of sampling steps due to improved sampling algorithm
Reduction in the number of sampling steps by distillation technique

However, there have been no studies that have "investigated sampling efficiency in terms of model size. One reason for this is that building a high-quality image generation model from scratch is enormously time-consuming and costly.

Therefore, it was "resource difficult to make models of various sizes" in the first place.

Details of this study's validation

The authors used the 866M-parameter "Stable Diffusion v1.5" as a baseline to train Text-to-Image latent diffusion models (LDMs) with parameters varying from 39M to 5B.

Below is a list of Stable Diffusions (Table 1) that were trained for comparative validation of different model sizes.

All of these models are scaled up by increasing the number of filters in the residual block, and the other architectural elements are kept the same.

The following shows the architectural differences for each model size when scaling.

In addition, each model was trained with 500K steps, a batch size of 2048, and a learning rate of 1e-4.In addition, the sampler during inference is set to DDIM, 50 steps, and a guidance scale of 7.5.

The results of the images generated by each model are then shown below.

The above figure shows that the performance of image generation improves in proportion to the model size.

Under these conditions, the following six experiments are reported here to evaluate the performance of the scaled-up model.

Investigation of the relationship between learning computation resources and LDM performance
Investigating performance in downstream tasks using pre-trained LDMs
Investigation of the relationship between sampling cost (number of steps) and LDM performance
Investigation of the relationship between sampler type and efficiency (LDM performance)
Investigation of the relationship between sampling cost and LDM performance in downstream tasks
Performance comparison between distilled and undistilled models

For testing Text-to-Image Stable Diffusion, we used 30k samples in the COCO 2014 validation set.

The DIV2K validation set is also used to evaluate the performance of downstream tasks.

Relationship between learning computation resources and LDM performance

The relationship between the computational resources used for training and the performance of the model is as follows

The FID on the left is the "smaller is better" value and the CLIP on the right is the "larger is better" value.

Results show that for less than 1G, additional computing resources improve the performance of all models.

However, above a certain size, we can see that it can be a headache.

Performance in downstream tasks with pre-trained LDMs

Here, pre-trained LDMs are used to verify scaling properties in downstream tasks such as super-resolution (higher resolution) and DreamBooth (image generation).

Specifically, each LDM was fine-tuned according to the above two downstream tasks, and then their performance in each downstream task was compared.

The pre-trained models used here are the same as those in Table 1 above.

The performance transition results for the super-resolution task are as follows

For the FID on the left side of the above figure, we see that the performance improves in proportion to the model size, regardless of the computational complexity. In other words, here we see that "the larger the pre-trainedmodel, the better the performance on the super resolution task.

However, looking at the LPIPS on the right side of the above figure, we can see that performance clearly improves in proportion to the amount of computation, regardless of size.

Next, let's look at the results of the following image generation.

As one can see, increasing the model size also improves the results.

Finally, let's also look at the results of the downstream task of image generation using DreamBooth.

As expected, performance has increased in proportion to the model size.

These results show that the performance of downstream tasks using pre-trained LDMs is proportional to the performance (number of parameters) of the pre-trained model.

Relationship between sampling cost (number of steps) and LDM performance

This section examines the question, "Does increasing the sampling cost (number of steps) improve LDM performance regardless of model size?

To this end, we first conduct experiments to determine the optimal guidance scale for each different model size and number of sampling steps.

For example, the figure below shows the change in image generation performance when the guidance scale is transitioned from 1.5 to 8.0 at equal intervals (top is LDM with 145M parameters and bottom is LDM with 866M parameters, both with 50 steps).

This will show that the optimal guidance scale value will vary from model to model.

The FID score is also used to quantitatively determine the optimal guidance scale. The figure belowshows the relationship betweenguidance scale and performance inText-to-Image.

Looking at the left figure (LDM for 145M parameters) and the center figure (LDM for 558M parameters), one can see that the optimal guidance scale changes as the number of sampling steps increases.

Then, referring to the figure on the right, the "optimal guidance scale value for each sampling step in each model" can be determined.

Next, we compare the performance of each LDM against the sampling cost (normalized cost × sampling steps) using the optimal guidance scale determined above.

The results show that at small sampling costs, the smaller models often have better FID scores than the larger models.

As a test, let's look at the case where Sampling Costs=6 and the case where Sampling Costs=12, referring to the figure on the right.

Sampling Cost	Number of model parameters	FID (the smaller the better)
Sampling Costs=6	145M	Approximately 19
Sampling Costs=6	866M	approximately 26
Sampling Costs=12	145M	approximately 17
Sampling Costs=12	866M	approximately 20

The table above shows that under conditions of small sampling cost (inference cost), the smaller models achieve higher performance.

The same is roughly true when comparing other model sizes.

The figure below also shows this visually.

Relationship between sampler type and efficiency (LDM performance)

Here wecompare the performance of LDM with DDPM and DPM-Solver++ in addition to the DDIM sampler to confirm that the scaling characteristics of LDM appear consistently regardless of the sampler type.

The results are as follows

The solid line on the left is DDMP and the dashed line is DDIM; the solid line on the right is DPM-Solver++ and the dashed line is DDIM.

The results show that DDPM < DDIM < DPM-Solver++ in terms of performance.

Also common to all samplers is that "under the same sampling cost, the smaller model performs better than the larger model, regardless of the sampler type.

This is evident by comparing the performance of each LDM at the same sampling cost.

Relationship between sampling cost and LDM performance in downstream tasks

Here, the sampling efficiency of LDM in downstream tasks, especially in super-resolution (SR) tasks, is examined.

The results are as follows

The results show that when the number of sampling steps is less than 20 (left panel), the smaller models tend to perform better than the larger models under the same sampling cost.

On the other hand, when the number of sampling steps exceeds 20, larger models are found to have higher sampling efficiency.

Performance comparison between distilled and undistilled models

Here we pre-distill models and compare the performance of those distilled models.

Specifically, we tested all distillation models in a 4-step sampling and compared each distillation model to an undistilled model with normalized sampling costs.

The results are as follows

Resultsshow thatdistillationsignificantly improves the generation performance of all models in 4-step samplingand improves the overall FID score. However, when the sampling cost was about 8, the smaller undistilled 83M model achieved performance comparable to the larger distilled 866M model.

This result would further support the scaling sampling efficiency of LDM, which is also valid in the context of diffusion model distillation.

summary

This article introduced a study that investigated the scaling properties of latent diffusion models (LDMs).

One of the limitations of this study is that " the claims about the scalability of the models presented in this study are limited to the specific model families investigated in this study."

In other words, the facts established in this study may be possible because of the Stable Diffusion used in this study.

In my personal opinion, I felt that those with limited computer specs need not force themselves to use a larger model (although I don't think they would be able to run it in the first place).