Catch up on the latest AI articles

Apple Developed A Large Scale Autoregressive Image Model That Is Scalable Like An LLM.

Apple Developed A Large Scale Autoregressive Image Model That Is Scalable Like An LLM.

Computer Vision

3 main points
✔️ Proposes an autoregressive learning image model, AIM, as an image version of LLM
✔️ Quality of pre-trained image features improves with model size and data quality, and downstream task performance improves with pre-training performance
✔️ 7 billion parameters of AIM were pre-trained on 2 billion images and achieved 84% accuracy on the ImageNet-1k task, with no signs of performance saturation.

Scalable Pre-training of Large Autoregressive Image Models
written by Alaaeldin El-NoubyMichal KleinShuangfei ZhaiMiguel Angel BautistaAlexander ToshevVaishaal ShankarJoshua M SusskindArmand Joulin
(Submitted on 16 Jan 2024)
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In 2012, AlexNet, based on convolutional neural networks (CNN), was announced by the University of Toronto and won an image recognition competition. This success raised expectations that the deeper the layers, the better the image recognition accuracy, and deep learning research quickly became popular.

Although deepening the number of layers has caused performance degradation due to gradient loss during model parameter learning, a shortcut connection proposal for ResNet published by Microsoft Research in 2015 confirmed that performance can be improved even when the number of layers is increased up to 152 layers.

Thus, image recognition tasks have led to larger model sizes, and in 2017, Transformer was announced as a language model for solving natural language processing tasks.

In 2022, a prototype of ChatGPT, a large-scale language model based on Transformer, was released by Open AI. The ChatGPT is a very powerful prototype. Furthermore, there was no sign of a performance limit to the increase in the number of model parameters, and the race to larger model sizes was on, leading to an explosion of research on large-scale language models.

Looking back in history, deep learning research has developed in image recognition research, followed by the large-scale development of generative models, especially transformers, in language processing research.

Specifically, we are talking about creating a large-scale "image" model rather than a large-scale "language" model based on the Transformer. However, this does not mean that Transformer has not been applied to image recognition.

A well-known application of the Transformer to image recognition is the Vision Transformer (ViT). This model was published by Google in 2021. This study showed that Transformer can replace CNNs if there is enough image data for pre-training.

Vision Transformer, like the large-scale language model, is based on Transformer, but its pre-training method differs from that of the large-scale language model. In the large-scale language model, auto-regressive learning is used as a pre-training method, i.e., the model does not provide labels from the outside, but instead predicts the next word in a sentence by showing up to a certain word in the sentence. class) are given as a set with training images for learning.

In the paper presented here, the model base is the same as the Vision Transfomer, but as with the large-scale language model, Apple verified whether it is possible to develop a technology that can improve image recognition accuracy as the number of model parameters and training data increases through autoregressive learning.

We will now describe the proposed Autoregressive Image Models (AIM) and their evaluation results.

Autoregressive Image Model (AIM)

Preliminary Study Flow

Figure 1 shows the pre-training flow using the proposed method, the autoregressive image model (AIM).

Figure 1: AIM pre-study flow

Transformer's auto-regressive learning learns to predict the next word in a given sentence each time from the words shown so far, starting from the left. To apply this Transformer's autoregressive learning to images, we need to represent images as sentences composed of words. AIM then divides the input image into non-overlapping patch images (partial regions) as shown in Figure 1, orders the patch images, and learns to guess the next patch image.

Each patch image from which the original image is segmented is dimensionality reduced by linear mapping and feature extraction is performed. The features of the patch images are input to the Transformer, which performs a self-attention process (i.e., updating the patch features by taking the context into account. Self-attention processing using a prefix causality mask, described below). The MLP then predicts the next patch image based on the extracted image features.

The model predicts the next patch image in raster order (the order of the numbers assigned to each patch image in Figure 1. The model predicts the next patch image in raster order (read from the top row).

The output of MLP in Figure 1 shows the relationship between predicting patch image 2 given patch image 1 and predicting patch image 3 given patch images 1 and 2. Since there is no way to predict a patch image if no patch image is given, the patch image prediction results start at 2 and end at 9, since there is no patch image to predict next if patch images are given up to 9.

The error functions (autoregressive objective functions) of the predictions and correct answers for the second through ninth patch images output by the previous model are shown in Equation 1.

Equation 1. Loss function at training for the autoregressive model

The x hat is the prediction of the kth patch image in the AIM with model parameter θ. The no hat is the value of the correct patch image. We learn θ to minimize the average of the sum of the squares of the errors between the predictions and the correct pixel value vector for each patch image. (As shown in Figure 1, if the output of the model is the second through ninth patches, we might think of it as starting with k=2 in effect.)

prefix causality mask

Since the problem of predicting the next patch image is to protect the relationship of predicting the next patch image from the patch images shown so far, if information about the next patch image is obtained, it would be a form of cheating. and updates the features of each word by computing weighted sums with the other words in the sentence, so simply computing weighted sums with all the words would result in this cheating.

So, when calculating the prediction error for the next patch image during pre-training, the causal mask uses only the first patch image when predicting the second patch image, and when predicting the third patch image, it uses the first and second patch images to perform self-attention only between the patch images shown so far The first and second patch images are used.

However, there is a problem with causal masks, which do not work well when adapted to a downstream task (in this paper, a new image recognition task that is different from the task at the time of pre-training).

In pre-training, since this is an autoregressive task, there is a restriction that future patches must not be shown. On the other hand, in the downstream task (image recognition task), all patch images related to the image to be recognized may be seen at once. Rather, it is natural to consider the features of all patch images before making an image recognition decision. However, if a simple causal mask is used during pre-training, features considering multiple image patches cannot be learned well. Therefore, pre-training that also takes downstream tasks into account is desirable.

Therefore, this paper uses a prefix causal mask. The prefix causal mask performs self-attention processing by fully utilizing patch images up to a certain number of patch images = prefix length determined by the uniform distribution. After that, it performs self-attention processing in the same way as the causal mask. When calculating the prediction error, the prediction of patch images exceeding the prefix length is targeted.

An image of the prefix causal masked pre-training and downstream adaptation is shown in Figure 2.

Figure 2. "Pre-Learning with Prefix Causality Mask" and Downstream Adaptation

It resembles a schematic diagram of a neural network, although there is no explanation in this paper as to what the circles and lines in the diagram represent.

If we consider the input, intermediate, and output layers from the bottom of the figure to be a 3-layer neural network with 5 dimensions, we can think of the 5 dimensions of the input layer here as corresponding to the 5 patches. In other words, the horizontal circles can be thought of as the same as the arrangement of the five patch images, showing in a neural network style how other patch images are involved when updating features through self-attention processing of the five patch images.

Here, the left picture in Figure 2 shows the pre-training situation with the prefix causal mask. A prefix causal mask with a prefix length of 3 is applied, and it can be thought of as a neural network such that up to three image patches counted from the left are all combined.

In other words, the first, second, and third image patches from the left can be considered to have features updated based on the first through third image patches, respectively (concerning the first and second image patches, as if the features were updated in some future patch).

The fourth and fifth image patches are supposed to represent images that are updated with themselves and the previous image patches (with the causal mask applied and the features of the fourth image patch not updated with the fifth image patch, which is a future image patch for the fourth).

The picture on the right in Figure 2 is the image of downstream adaptation. Downstream adaptation is a natural task to consider all patches in all images to be recognized as being fully merged, so full merging is desirable, but pre-training always leads to learning in a cheat state if full merging is always allowed.

So, if you use a prefix causal mask, you can think of it as pre-training with a causal mask that partially preserves feature updates in the state of all joins, assuming downstream adaptation.

MLP Prediction Head

Ideally, Transformer pre-training should be able to learn generic image features that apply to a variety of downstream tasks. If image features are learned that are specific to the objective function at the time of pre-training, they will be difficult to adapt to downstream tasks.

Therefore, to increase adaptability to downstream tasks, an MLP (Multi-Layer Perceptron) prediction head is connected to the final layer and pre-trained. the MLP prediction head processes each patch image independently.

Then, when adapting to downstream tasks, the MLP prediction head from the pre-training is discarded and the remaining Transformer section is used as a general-purpose feature extractor.

downstream adaptation

Pre-training large models consumes significant computational resources. Fine-tuning is also painstaking.

Therefore, when training downstream tasks, the weights of the Transformer learned in the prior training are fixed. Only newly connected classification heads are trained. This prevents over-training on a small number of downstream task datasets and reduces downstream adaptation costs.

AIM Evaluation Results

Evaluate whether AIM shows scalable performance for an increasing number of model parameters and training data.

We pre-trained AIM on 2 billion unorganized images, adapted it to 15 downstream tasks (various image recognition (=image class classification) benchmarks, including fine-grain recognition, medical images, satellite images, natural environments, and infographics), and evaluated its average performance on the 15 downstream tasks, as shown in Figure 3.

Figure 3. Scalability of AIM's pre-training performance and downstream task performance concerning the number of model parameters.

The graph in Figure 3 examines the scalability of AIM versus the number of model parameters. The horizontal axis is the validation loss in pre-training and the vertical axis is the average performance of 15 downstream tasks. each point in the graph labeled AIM-*B represents the number of model parameters in AIM. The numbers increase as one moves to the right, indicating, in order, 600 million, 1 billion, 3 billion, and 7 billion model parameters.

The graph shows the relationship between pre-training performance and downstream task performance when the number of AIM model parameters is increased, and it can be confirmed that as the number of AIM model parameters increases, pre-training performance improves and downstream task performance also improves.

Incidentally, in the ImageNet-1k benchmark, AIM shows an accuracy of 84.0 with 7 billion model parameters, which is better than the existing method using an autoencoder called MAE, which has an accuracy of 82.2. On the other hand, the DiNOv2 method outperforms AIM with an accuracy of 86.4. In contrast, the paper states that the results of the DiNOv2 evaluation are 1 to 1.5% better than the results of the AIM evaluation only because it uses high-resolution training images. (It is unclear if the nuance is that if AIM trains on higher resolution images as well, it can improve performance over DiNOv2.)

The graph in Figure 4 examines the scalability of AIM versus the number of model parameters.

Figure 4: Scalability of AIM's downstream task performance concerning the number of training data

The horizontal axis in Figure 4 is the number of unique training images in AIM, and the vertical axis is the average performance of AIM on 15 image recognition benchmarks when the number of model parameters is 600 million. In other words, it shows the relationship between the number of AIM training data and downstream task performance when the number of model parameters is 600 million, and the recognition accuracy improves as the number of AIM training data increases from 1 million images to 100 million images to 2 billion images. There is no indication of saturation of recognition accuracy.

Figures 3 and 4 suggest that AIM improves image recognition accuracy in a scalable manner for increasing numbers of model parameters and training data.


In this paper, we proposed a method, AIM (Autoregressive Image Models), based on Vision Transformer, which can perform pre-training using an autoregressive objective function, to improve pre-training and downstream task performance in image recognition as well, depending on the number of model parameters and training data, like LLM. Autoregressive Image Models), a method that enables pre-training with an autoregressive objective function, based on the Vision Transformer.

When learning an autoregressive objective function with Transformer, which uses a simple causal mask for auto-attention processing, we combine auto-attention processing with a prefix causal mask because it does not use the relationships across the entire image recognition target image well and does not adapt well to downstream tasks.

To learn general-purpose image features that do not overfit the autoregressive objective function, the Transformer section is configured as a general-purpose image feature extractor and the MLP section is attached at the latter stage as the next patch image predictor.

Furthermore, to reduce adaptation cost when adapting to downstream tasks, the MLP portion of the pre-training is replaced with an image classification MLP, and only the image classification MLP is trained to adapt to downstream tasks.

The evaluation results showed that pre-training and downstream task performance improved up to 2 billion pieces of training data and 7 billion model parameters, and no limits to performance improvement were identified.

If indeed image recognition accuracy continues to improve as we continue to scale image models, then the race to improve models for the sake of accuracy will essentially be over. Looking back in history, getting scalable performance gains by deepening deep learning was itself a major challenge, and I think that has been solved.

If this is the case, the performance will be improved by investing a lot of resources, and the only remaining research directions will be how much image recognition can be achieved with how few resources and how to apply the technology.

However, if compared to large-scale language models, I am concerned that this image version of LLM will inevitably appear to be limited in its ability to recognize images (class classification of certain images).

The great thing about large-scale language models is that they can answer a variety of questions with in-context learning, but this large-scale image model appears to be limited to image recognition, and learning for downstream task adaptation is necessary. There is a discrepancy with the expectation that the answers would be very accurate even with a large-scale language model.

I felt that to say that it is an image version of LLM, I wanted to have a surprise that would solve various tasks with such a zero shot.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us