Catch up on the latest AI articles

Einstein In Costume? GLIDE, A Powerful Generative Model

Einstein In Costume? GLIDE, A Powerful Generative Model


3 main points
✔️ Propose GLIDE that can generate diverse and high-resolution images from linguistic instructions
✔️ Generated images faithful to linguistic instructions exceed DALL-E
✔️ Release a mini-model for easy use

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
written by Alex NicholPrafulla DhariwalAditya RameshPranav ShyamPamela MishkinBob McGrewIlya SutskeverMark Chen
(Submitted on 20 Dec 2021 (v1), last revised 22 Dec 2021 (this version, v2))
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.

such as ...

Just a year ago, DALL-E, which generates beautiful images from OpenAI text, had a big impact on the world. After that, there was a lot of excitement on social networking sites about language-directed conditional image generation models that combine GAN-based generation models and CLIP. While images such as illustrations and photographs can be easily expressed by language, their production often requires a lot of effort, so these models are attracting attention.

On the other hand, Diffusion Models have surpassed the resolution of GANs, as described in "Image Generation Beats BigGAN? Diffusion Models have surpassed the resolution of GANs, as introduced in "About Diffusion Models". On the other hand, Diffusion Models have been surpassing the resolution of GANs, as introduced in the article "Diffusion Models Beat BigGAN in Image Generation?

Let's take a look at the images generated by GLIDE. We can generate extremely high-resolution images that follow complex and detailed language instructions. You can see that GLIDE generates images that are not in the training data. Furthermore, we can infer that some of these images are not present in the training data. For example, I was surprised that it can handle the last 'illustration of Einstein in superhero clothes' and imaginative content.

After explaining the key concept of the GLIDE model, the language-directed conditional method, in Chapter 2 of the article, the experimental results (quantitative and qualitative) are presented in Chapter 3.

GLIDE (Linguistic Conditional Diffusion Model)

diffusion model

Most of the diffusion models used in recent years are based on DDPM (Denoising Diffusion probabilistic Models).

(Figure from DDPM ) The DDPM diffusion model consists of two processes, Diffusion Process, and Reverse process. The diffusion process refers to the process of adding Gaussian noise to X_0 in Figure 2, which results in a complete noise like X_T. The inverse process starts from X_T, predicts the added noise, and removes it to produce an image like X_0.

The model to be trained is with parameter θ. The model takes an image with noise as input and outputs the mean and variance of the Gaussian distribution. The architecture is UNet as in prior work, and the loss function for learning is the predicted loss of Gaussian noise .

conditional diffusion model

The diffusion model is a simple model that takes an image x_t as input and predicts the mean µ and variance σ of the Gaussian noise. Two types of conditional diffusion models based on it have been proposed. We will introduce each of them.

One is to apply a conditional restriction on the mean μ (the equation above). The gradient of the classifier is used as a weight to induce a high probability of predicting y when the image x_t is input to the classifier (with parameter φ), multiplied by the average Gaussian noise µ. s is a hyperparameter that controls the degree of restriction by the classifier. This method requires a classifier to be prepared separately from the diffusion model. The advantage is that it can be implemented using any classifier based on the trained diffusion model. The disadvantage is that it is costly to prepare two models.

The second method does not use a classifier (Classifier-free guidance). Instead, it is conditioned on y and fed into a noise prediction model. Classifier-free guidance requires only two passes through the model to predict the noise and does not require a classifier. However, the trained model cannot cope with new conditioning methods and has to be trained again each time.

As described above, both of the two types of conditional methods have advantages and disadvantages, and there is a trade-off. However, in this study, we conducted experiments using both at the same time.

Language conditional method

In a conditional diffusion model, it is simple to perform linguistic conditionalization. For example, using the CLIP classifier, it can be expressed by the following formula

In this section, we calculate the gradient of the similarity between the language and the image on the CLIP latent space and call it the CLIP Guidance method.

On the other hand, Classifier-free guidance can be realized by using the linguistic instruction c instead of the label y, as in the above equation.


Since the goal of this study is to generate high-resolution images from verbal instructions, several innovations were incorporated during training, For example, we use a diffusion model with 3.5 billion parameters to generate 64x64 resolution images, and then we use an upsampling diffusion model with 1.5 billion parameters to increase the resolution to 256x256. To evaluate the generated images using the CLIP model, we also prepared a Noised CLIP model to deal with noisy images. If you want to know more details, please refer to Chapter 4 (Traning) and the official implementation of the paper.

quantitative experiment

Figure 6 shows that there is a trade-off between diversity and fidelity in the generated images of the diffusion model. (a) In the figure, the horizontal axis is the resolution and the vertical axis is the diversity index, which decreases as the plot moves to the right. In other words, as the resolution increases, the diversity decreases, and only similar images can be generated. Classifier-free guidance, which does not use a classifier, is plotted in the upper right corner of CLIP guidance, so it has better accuracy (higher resolution for the same diversity).

We use FID (smaller images have higher resolution) and IS (larger images have higher resolution and diversity) as evaluation metrics for the generated images. (b) From the figure, we can see that the FID degrades as the IS score (considered here as a measure of diversity) increases, indicating that there is a trade-off between diversity and resolution.

In figure (c), the CLIP score measures the degree of matching between the generated image and the linguistic instructions, and it can be seen that CLIP guidance can maintain the resolution while increasing the CLIP score (which is considered to be in order since it was used as the objective function during training).

qualitative experiment

In addition to the images shown here, there are many other interesting examples in this paper. However, due to increasing social concerns about safety, OpenAI has released models with reduced parameters and data filtering, such as removing people. Therefore, if you use the published model, it may fail to generate the image shown in the example, or it may not be able to generate a person.


Text to Image Text to Image

The first picture shows the level of authenticity, including the inverted shadow in the water, and the second picture shows the fox in the famous work "starry night". In the second picture, we can see a fox in the masterpiece "starry night", which shows how artists can create more diverse works when GLIDE is a powerful tool.

Image Editing

Using the characteristics of the diffusion model, which can generate images from noise, image editing can be smoothly realized by applying a mask (noise) to a part of the image before generating the image. By masking the part of the image you want to edit (the green part) and inputting linguistic instructions, you can generate the image you want. For example, when we generated a vase on a desk, we were able to generate the shadow of the vase as well, which we believe is proof that it is a powerful image editing tool.


What do you think? How do you think GLIDE, which can generate high-resolution images in the language, can be applied in other ways?

From DALL-E in January 2021 to GLIDE in December 2021, the performance of language-conditional image generation models has greatly improved. I expect that the next direction of research will be the development of language-conditional generative models for video generation and 3D models. Although this does not mean that AI models can understand human language, I believe it is a definite step forward and I look forward to further development in the future.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us