Catch up on the latest AI articles

Effects Of Model And Prior Learning Scale On Catastrophic Forgetting

Effects Of Model And Prior Learning Scale On Catastrophic Forgetting

Continual Learning

3 main points
✔️ Investigate catastrophic forgetting of prior learning models
✔️ Demonstrate that larger pre-trained models are more resistant to catastrophic forgetting
✔️ Demonstrate the relationship between similarity of class representation of models and pre-training models

Effect of scale on catastrophic forgetting in neural networks
written by Vinay Venkatesh RamaseshAitor LewkowyczEthan Dyer
(Submitted on 29 Sep 2021)
Comments: ICLR2022


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Large-scale pre-training models have been actively studied in the fields of computer vision and natural language processing.

So how does catastrophic forgetting, a major problem in machine learning, affect such pre-training models?

In this article, we present research investigating large-scale pre-training models, including how much catastrophic forgetting varies with model and dataset size.

experimental setup

About Tasks

In our experiments, we use the CIFAR-10 and CIFAR-100 datasets and evaluate them in a standard task partitioning setting. In CIFAR-10, learning is performed sequentially on two tasks consisting of five classes. In CIFAR-100, the training is split into 10 classes and 50 classes. We also experiment with input distribution-shifted settings. In this case, we sample a fixed subset of 20 superclasses with different subclasses for each task.

In our experiments with language models, we use the IMDb Reviews and English Wikipedia datasets for the generative task of predicting the next tokens.

About the model

The models to investigate catastrophic forgetting are Vision Transformer (ViT) and ResNet, and the parameters and other settings are as follows.

About the setting at the time of learning

When pre-training each model, we use the ImageNet21k dataset, which consists of about 26000 categories and contains about 14 million images, using the Adam optimizer. During fine-tuning, we use the SGD optimizer with β = 0.9. For the first task, we employ a linear warm-up and cosine decay schedule, and thereafter we use a fixed learning rate.

experimental results

Model scale and catastrophic forgetting

First, we investigate the relationship between the size of the pre-trained model and catastrophic forgetting. The results are as follows.

In this figure, the performance of the first task (Task A) and the second task (Task B) of Split CIFAR-10 are plotted for models of different sizes.

As shown in the figure, for both Vision Transformer and ResNet, the performance degradation tends to be smaller for models with a larger number of parameters. For example, Vit-xS with 5.7M parameters degraded accuracy by about 6.5%, while Vit-B with 86.7M parameters degraded accuracy by less than 0.5%. In addition, for Split CIFAR-100 (10 classes x 10 tasks and 50 classes x 2 tasks), the following results are obtained.

The left figure shows the task accuracy for each task when the ViT-B model was trained for 10 tasks. The average accuracy loss for each task was 1.8%, and the maximum accuracy loss was 2.9%. The right figure shows the accuracy loss of each ViT on CIFAR-100 (2 tasks), and again, the larger the model, the smaller the accuracy loss.

The input distribution shift settings are as follows.

In general, the larger the scale of the prior learning model, the higher the resistance to catastrophic forgetting.

Fine-tuning datasets and catastrophic forgetting

Next, we investigate what happens to the relationship between model scale and catastrophic forgetting tolerance as described above for different fine-tuning data sets. The results are as follows.

This figure shows the relationship between the number of parameters and the performance of ViT and ResNet when training two tasks on non-CIFAR datasets.

More detailed results for each task are shown below.

Although the distribution of the plots differed slightly from task to task, the trend toward smaller catastrophic forgetting for larger model scales was similar for all tasks.

Comparing pre-trained models with models trained from scratch

Next, we test whether the relationship between catastrophic forgetting and model scale observed in the previous experiments holds for models trained from scratch rather than pre-trained models. Specifically, we compare the accuracy of the model learned from scratch with that of the pre-trained model, with the pre-trained model given a handicap so that the accuracy of the pre-trained model is the same as that of the model learned from scratch.

The results are as follows

As shown in the figure, the catastrophic forgetting is larger for the model trained from scratch (gray circles) compared to the results for the pre-trained model (colored circles).

Furthermore, this trend of catastrophic forgetting is similar regardless of the model scale. This indicates that the resistance to catastrophic forgetting with increasing model scale is a property of pre-trained models.

Note that only ResNet is focused on in this experiment, as the ViT model needs to undergo extensive pre-training to perform adequately in the image classification task.

Pre-training time, dataset size, fine-tuning time, and catastrophic forgetting

The above experiments show that pre-training models increase tolerance to catastrophic forgetting. We now investigate how the pre-training time, dataset size, and the number of steps during fine-tuning are related to catastrophic forgetting.

To begin, the relationship between the number of steps and catastrophic forgetting during pre-training was as follows.

In the right panel, the larger the number of pre-training steps, the darker the color is plotted. The results show that the longer the pre-training time, the better the performance in the downstream task and the better the resistance to catastrophic forgetting.

Next, the results of changing the dataset size during pre-training and the number of steps during fine-tuning are shown below.

In the left figure, we show the results when the dataset size is varied from equal to 1/16th of the dataset size during training. We found that the degree of catastrophic forgetting does not change significantly as the dataset size decreases. For example, when the dataset size is 1/16th, the performance degradation is limited to about 3%.

This suggests that the dataset size during pre-training is not so important if we want to suppress catastrophic forgetting. The center and right figures show the case when the number of steps during fine-tuning is changed. In general, increasing the number of fine-tuning steps does not increase catastrophic forgetting.

Duplication of representation and catastrophic forgetting

Then, why do pre-trained models with large model scales tend to be resistant to catastrophic forgetting? Here, we introduce trace overlap as a measure of the similarity between the model representations in Task A and Task B (details are omitted). First, we check how the similarity between the class representations obtained by trace overlap differs between the pre-trained model and the model trained from scratch (ResNet). The results are as follows.

As shown in the figure, the similarity between class representations is higher when trained from scratch (scratch), while significantly lower values are obtained for the pre-trained model. This suggests that the pre-training model can store representations of different classes in a way that they have low similarity (low overlap) with each other.

In addition, the average overlap between model scale and interclass representation is as follows.

In this figure, the overlap decreases as the model scale increase for the pre-trained model, while there is no decreasing trend for the model trained from scratch. This suggests that the pre-training model can reduce the overlap between classes more as the model scale increases.

Catastrophic Forgetting in Language Models

Finally, we examined the trend of catastrophic forgetting for the language model instead of the image classification task, and the results are as follows.

Experimental results show that larger model scales improve the performance of the two tasks while suppressing catastrophic forgetting. However, the distribution of performance differs significantly from that of the image classification task, suggesting that scaling behavior may differ between the image classification and natural language modeling tasks.


In the paper presented in this article, extensive experiments showed that increasing the prior learning model scale leads to the suppression of catastrophic forgetting.

We also found that the pre-trained models differ from the models trained from scratch in the tendency of similarity (overlap) between class representations and that this overlap decreases as the model scale increases.

Although many of the experiments were conducted in a setting where two tasks were learned in succession, the appendix of the original paper describes the results of the experiments in more detail, if you are interested.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us