Learn How OpenAI Trained Its 12-billion Parameter Text-to-image Generator: DALL-E
3 main points
✔️ A 12-billion parameter image-to-text generation model and 250-million image-captions dataset.
✔️ Several techniques for training such a large model.
✔️ 90% zero-shot realism and accuracy scores on MS-COCO captions.
Zero-Shot Text-to-Image Generation
Written by Aditya Ramesh, MikhailPavlov, Gabriel Goh, Scott Gray, Chelsea Voss, AlecRadford, MarkChen, Ilya Sutskever
(Submitted 24 Feb 2021 (this version), latest version 26 Feb 2021 (v2))
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Since the DRAW model first introduced text to image generation, tremendous improvements have been made in the field. Since then, methods like integrating GANs, self-attention, auxiliary losses, have made it possible for these models to generate high visual-fidelity images and zero-shot generalizing capabilities. Nevertheless, these models are prone to obfuscation such as incorrect object placements, unnatural blending, and object distortion.
The autoregressive transformer has shown impressive results when model size, data, and compute are scaled appropriately. Text-to-image generation models up to now have been trained and evaluated on relatively smaller datasets like the MS-COCO and CUB. So, it would be worthwhile to see if scaling the model size and dataset could improve performance.
To find how scaling works, in this case, OpenAI trained a model with 12-Billion parameters on 250 million text-image pairs scraped off the Internet. As shown in the figure above, the model is very robust. It generates excellent zero-shot images on the MS-COCO dataset without using any training labels and is competitive with previous custom-trained models.
In order to train the 12-billion parameter model, 250 million image-text pairs were used: some scraped from the Internet and some from the Conceptual captions and YFCC100M datasets. It does not include the entire MS-COCO dataset but includes some validation images without their captions.
Two-Stage Training Process
We train an autoregressive transformer where the text and images pass through the model as a single stream of data. However, even an image of resolution 256x256 forms a 256x256x3 length sequence which demands high computation and memory. Also, unlike CNNs, the self-attention cannot capture the local features of the images efficiently which would prevent the model from forming visually recognizable images. Therefore, a variational autoencoder(dVAE) is used to compress the images to 32x32 grid tokens, each with 8192 possible values.
The training process can be viewed as maximizing the evidence lower bound(ELBO) over images x, captions y, and tokens z for the encoded RGB image. The joint log-likelihood over these variables can be modeled as,
We make an assumption that the captions(y) are conditionally independent of the image(x) given the tokens(z). So, the lower bound is mathematically given by,
We will first model ELBO wrt φ and θ: the Visual Codebook. pθ is the probability distribution over RGB images generated by dVAE decoder given the image tokens and captions. qφ is the distribution over the image tokens generated by the dVAE encoder. Next, we model ELBO wrt ψ: the Prior. Here, pψ represents the joint distribution of the text and image tokens. Although the above bound is true only for β = 1, larger values were found to be beneficial. (DKL is the KL divergence)
1) Training the Visual Codebook
In this stage, we train a variational autoencoder(dVAE) using only the RGB images. As discussed earlier, the dVAE encoder encodes the RGB image into 32x32 grid tokens with K=8192 codebook vectors. We use ADAM optimizer to maximize the ELBO. qφ is a discrete distribution so the reparameterization technique is useless for maximization, and therefore we make use of gumbel-softmax relaxation technique to make it differentiable. The expectation over qφ is replaced with 1/qφT where parameter T controls the amount of relaxation.
During training T is annealed to 1/16 of its value to get the relaxed validation ELBO close to the actual ELBO. It was also found that using 1x1 convolutions at the end of encoder and the beginning of decoders(near relaxation operation) improved the approximation of true ELBO. Furthermore, the outgoing activations in the encoder and decoder resblocks were scaled by a small constant for earlier stable training.
2) Training the Prior
The prior pψ is modeled using a 12-billion parameter sparse transformer model. We use argmax to sample 32 × 32 = 1024 image tokens from the dVAE. Each text-image pair is BPE-encoded. The BPE-encodings use 256 tokens for a vocabulary size 16,384. In addition, the image tokens are also encoded with a total vocabulary size of 8192. The text and image tokens are concatenated and processed together.
The transformer used is a decoder-only model with 64 self-attention layers each with 62 attention heads and hidden dimension 64. The image and text tokens can interact freely across all layers. We use 3 different types of self-attention masks: A standard text-to-text attention mask for captions, and a row, column, or convolution attention mask for images.
The maximum length of the captions is 256 tokens. The text and image tokens are separated by 2 padding tokens. Each of the 256 positions has a special padding token which is learned during training. The ADAM optimizer was used for training. Since our objective is mostly image modeling, a weighted sum of the cross-entropy loss for text and image is taken with weights 1/8 and 7/8 respectively.
1) Difficulty in Using Mixed Precision Training.
In order to save GPU memory, it is desirable to store parameters like Adam moment and activations in 16-bit precision. However, training in mixed-precision with more than one billion parameters is extremely challenging. As the model gets deeper and wider, the exponents of activation gradients fall outside the range of 16-bit format. The exponent part occupies about 5 bits in common GPUs which was insufficient for our model. Therefore, the norms of activations decrease in the later layers and get rounded to zero. This is called underflow, which was one of the biggest causes of instability.
This problem is solved by using a different gradient scale for each resblock in the model. It can be described using the above image. The solid lines represent the forward pass and the dashed lines represent the backward pass. Each incoming gradient is scaled/filtered, converted to 16-bit, and then unscaled/filtered before it leaves the resblock. The filter operation sets NaN or Inf values to 0. Without this, any non-finite event would cause the gradients for all preceding blocks to drop and cause underflow.
Parameter sharing was used to train the model which required 24 GB of memory with mixed-precision training. Each parameter array is sharded(divided) among 8 GPUs in a node. In each GPU, the current activations are computed while simultaneously All-gather is used to prefetch the parameter shards for the next parameter block. In a similar manner, during the backward pass, the activations and gradients of the current block are computed while simultaneously all-gather is used to prefetch the parameter shards for the previous block. When all 8 GPUs in the node finish computing gradients wrt an all-gathered parameter, the reduce-scatter averages the gradients across all 8 GPUs and leaves each GPU with gradients only for its parameter block.
The communication is fast enough between GPUs on the same node such that it can be overlapped with those heavy computations. The major challenge is the inter-Node communication which causes a delay in averaging the gradients computed in different nodes. This problem was solved using a gradient compression technique. Each GPU computes low-rank factors for its parameters independent of its neighboring GPUs. After that, an error buffer stores the difference between the gradients computed from the low-rank factors and the average of gradients computed by GPUs in the same node. This reduces the communication overhead from large uncompressed parameters to smaller communication operations for the low-rank factors.
Our model was compared to several other SOTA- models like AttnGAN, DM-GAN, DF-GAN. The above diagram shows the comparison of the samples generated by these models. Moreover, in order to test how realistic our model's images were, human reviewers were asked to inspect the images generated by our model DALL-E and other models.
Our model beats the DF-GAN model significantly in terms of both realism and accuracy on captions from the MS-COCO dataset. Accuracy represents how well the captions match the image(93.3% of the time).
However, our model did not perform very well on the CUB dataset (shown above). There is a large gap (40 FID) between our model and the best-performing model. It seems that the zero-shot approach does not properly fit the specialized distributions of the CUB dataset and fine-tuning could be an option for this issue.
There was a lot of hype around DALL-E when OpenAI first announced the model and released some images it generated and rightly so (Do you remember the armchair in the shape of an avocado?). Compared to previous models, the results generated by the text-to-image generator are very impressive even for zero-shot tasks. Like the GPT models, it shows that scaling the model, training instances, compute along with proper training can yield significant performance improvement for deep learning models. In some cases, the model generates excellent images relevant to the captions, with a hint of what we might call 'creativity'. It would not be long before such text-to-image models could be extended for use in actual movies to generate the scenes, or advertisement posters, for designing objects(like the armchair) and so much more.
Categories related to this article