# Exploring The Limits Of Large Scale Pre-training Models!

3 main points
✔️ Investigate downstream task performance of large-scale pre-training models through extensive experiments
✔️ Investigate the above saturation phenomenon in detail

written by Samira AbnarMostafa DehghaniBehnam NeyshaburHanie Sedghi
(Submitted on 5 Oct 2021)

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## first of all

In the current machine learning field, it is often the policy to perform large-scale pre-training in upstream tasks such as ImageNet, and then adapt the model to downstream tasks. However, is it necessarily true that making the pre-training model larger or improving the performance of the upstream task will lead to improved performance in the downstream task?

In the paper presented in this article, we performed over 4800 large-scale experiments on Vision Transformer, MLP-Mixer, and ResNet, with parameters ranging from 10 million to 10 billion, and compared the model scaling, hyper-parameter architecture selection, and downstream The relationship between the task performance and the performance of the task was investigated in detail.

The results show interesting results that differ from existing studies, such as that improving performance in the upstream task saturates performance in the downstream task. (This paper is Accepted (Spotlight) in ICLR2022)

## experimental setup

In our experiments, we perform pre-training on an upstream task with a large amount of data and evaluate the performance of downstream tasks by few-shot learning and fine-tuning. For the upstream task, we use JFT-300M with 303M images and 18k classes, or ImageNet21K with 14M images and 21k classes. The downstream tasks are as follows.

See Appendix G of the original paper for more details on the training.

## experimental results

### Saturation phenomenon of downstream task accuracy

First, we experiment to see how performance improvements in the upstream task affect performance in the downstream task. The results for upstream (US) and downstream (DS) task performance is shown below when we pre-trained on the upstream task (JFT) and evaluated the downstream task in a few-shot (25-shot) setting.

The upstream (US) and downstream (DS) task performance results are also shown below when we pre-trained on the upstream task (JFT, ImageNet) and evaluated on the downstream task in the few-shot (1-shot or 25-shot) setting.

Let $e_{DS},e_{US}$ denote the error (1-precision) of downstream and upstream tasks, respectively, and the following relation holds.

We define $e_{IR}$ to be the value of the downstream task error when the upstream task error is zero (precision is 1), which is the saturation value. The $k,\alpha$ are constants. As you can see from the plot of $1-e_{IR}$ in the figure above, the relationship between upstream and downstream task performance is not linear, and the downstream task performance saturates at some point ($1-e_{IR}$) below 1.

In the following, we will investigate this $e_{IR}$ further.

### Relationship between $e_{IR}$ and upstream and downstream tasks

In the above figure, the performance varied greatly depending on the type of downstream task and the number of shots. Here, the variation of $e_{IR}$ concerning the number of shots was as follows.

The correlation between $k,\alpha,e_{IR}$ and the number of shots for various upstream and downstream tasks is as follows.

In general, we find that $k,e_{IR}$ is negatively correlated with the number of shots, and $\alpha$ is positively correlated.

### The effects of scaling

We then conduct experiments to see how downstream and upstream tasks vary with dataset size, model size, and the number of epochs. The results are as follows.

The results for more downstream tasks are also shown below.

In general, changes in dataset size, model size, and the number of epochs have an impact on upstream task accuracy, but these parameters appear to have a little direct effect on downstream task accuracy. This is indicated by the fact that the downstream task accuracy plots lie on nearly identical curves when the three parameters are varied.

As in previous experiments, we also found that downstream task accuracy tended to saturate as upstream task accuracy was increased, and the degree of saturation varied depending on the type of downstream task.

### The relationship between upstream and downstream task accuracy

Previous experiments have shown that the degree of saturation varies depending on the downstream task. We now examine why saturation occurs earlier in some downstream tasks than in others. First, the Spearman rank correlation coefficients between upstream and downstream task accuracy were as follows.

We can see that UC-Merced and col_hist, which started saturating downstream task accuracy earlier, tend to have relatively low correlation coefficient values with upstream task accuracy. On the other hand, when different layers of representations were used for the downstream task, the downstream task accuracy resulted as follows (the upstream task is JFT).

Here, the green circles correspond to plots for three different parameters (dataset size, model size, and the number of epochs).

These figures show that for downstream tasks similar to the upstream task (e.g., ImageNet), the later the layer representation, the higher the downstream task accuracy, while for downstream tasks where saturation occurred earlier, such as UC-Merced and col_hist, the last layer representation was not necessarily optimal. Given that previous studies have argued that the lower layers capture features that are common across different datasets/tasks, the saturation of performance in the downstream task may be because the network pre-trained in the upstream task lacks features that are necessary for good performance in the downstream task. The reason for this may be due to