How Does Pruning Of The ImageNet Pre-training Model Work In Downstream Tasks?

Pruning 09/09/2022

3 main points
✔️ Investigate the transition learning performance of ImageNet pre-trained models with pruning
✔️ Analyze pruning methods such as gradual sparsification, regularization, and LTH
✔️ Demonstrate that various Pruning methods exhibit different behavior when learning transitions

How Well Do Sparse Imagenet Models Transfer?
written by Eugenia Iofinova, Alexandra Peste, Mark Kurtz, Dan Alistarh
(Submitted on 26 Nov 2021 (v1), last revised 21 Apr 2022 (this version, v5))
Comments: CVPR2022.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

The computational cost of deep learning models is significant, and model compression techniques to mitigate this cost is a rapidly developing area.

In the paper presented in this article, we investigate how the Pruning method, which sets some of the weights to zero, works in the general transition learning setting, where a convolutional neural network (CNN) trained on ImageNet is adapted to a downstream task.

The results of this study show that the sparse model performs as well as or better than the transition learning performance of the dense model, that inference and learning can be significantly accelerated, and that the behavior of various pruning methods is different.

experimental setup

Our experiments investigate the transition learning performance when various model compression techniques are applied. The experimental setup is as follows.

Setting up transfer learning

When learning transitions, we consider two settings: full fine-tuning and linear fine-tuning. Full fine-tuning optimizes the entire feature set, while linear fine-tuning optimizes only the linear classifier in the final layer.

In the former case, only the non-zero weights of the original model are optimized, except for the final layer, and the mask is kept fixed. We also do not experiment in the from-scratch setting because learning from scratch (from-scratch) generally performs worse than transition learning. We also do not experiment with the pruning of the model on the downstream task.

network architecture

In our experiments, we mainly use ResNet50 to analyze the transition learning of sparse models.

Sparsification Methods

The pruning methods considered in the experiments can be divided into three main categories

Progressive Sparsification: Start with a high accuracy baseline model and gradually remove weights in several steps separated by fine-tuning periods.
Regularization Methods: Mechanisms are applied to improve sparsity during model training.
Lottery Ticket Hypothesis (LTH) method: start with a fully trained model, obtain a sparse mask of weights in one or more incremental steps, and restrict retraining on that mask.

For all three of these, we specifically use the following techniques

incremental sparsification technique

regularization technique

LTH method

LTH-for-Transfer (LTH-T)

About Downstream Tasks

The downstream tasks used for transfer learning are as follows

As noted in the table, we use the Top-1 accuracy or the average Validation accuracy per class for each task as the performance metric.

For each downstream task and model, we also compute the relative increase in error relative to the dense model baseline and the degree to which each method is faster, which we use as a performance measure.

experimental results

Validation accuracy in ImageNet

First, we check the validation accuracy of the model sparsified by Pruning on ImageNet. The results are as follows.

There are differences depending on the sparsity and the version of the Validation set, but in general WoodFisher and RigL ERK 5x show particularly good results.

linear fine-tuning

Next, we investigate the performance of various Pruning methods when only the linear classifier is fine-tuned in the downstream task. The results are as follows.

Note that the LTH-T method is designed for full fine-tuning and is therefore excluded from this analysis. The figure also shows the results for the sparsity of 80% (see the Linear Finetuning section).

In general, it can be seen that the choice of pruning method in the upstream task makes a significant difference in the downstream task performance.

This difference is more pronounced for specialized downstream tasks with fine classes, for example, in Aircraft we see a 15% difference in Top-1 accuracy between the best performing AC/DC and RigL ERK 5x and the worst performing WoodFisher.

Based on these results, the behavior of each method when the difference in Top1-accuracy between the dense backbone and the pruned model is used as a measure of the difficulty of each downstream task is as follows.

The figure shows that the regularization methods (AC/DC, STR, and RigL) tend to perform better than the baseline with a dense backbone as the task difficulty increases. This is more pronounced at 90% sparsity.

On the other hand, gradual sparsification methods (GMP, WoodFisher) do not exhibit this behavior. This suggests that regularization-based Pruning methods appear to be more suitable for linear fine-tuning when the downstream task is specialized or difficult.

It is also notable that sparsity does not correlate well with the performance of downstream tasks. For example, AC/DC and RigL have a 1~2% difference in ImageNet accuracy between models with 80% and 90% sparsity, while the relative error to a dense baseline remains flat. However, extreme sparsity (98%) tends to degrade performance.

In general, the following results were found

Some sparsification methods show consistent performance with dense models, while others sometimes outperform dense models.
There is a correlation between the transition learning performance of regularization methods and downstream task difficulty.
High sparsity is not necessarily detrimental to transition performance.

full fine-tuning

Next, the results for fine-tuning the entire model are shown below.

As in the case of linear fine-tuning, considerable performance differences were observed between the Pruning methods.

First, there is a consistent trend that quality decreases as the degree of sparsity are increased. In addition, the gradual sparsification method (WoodFisher, GMP) tends to show better transition learning performance than the other methods. In particular, at 80% and 90% sparsity, the downstream task performance was almost as good as that of the dense models, which is the opposite of the linear fine-tuning case.

Furthermore, looking at the downstream task performance, WoodFisher and GMP consistently showed the top performance, while the performance of the other methods tended to vary significantly from task to task.

In general, incremental sparsification methods are seen to be a good choice for full fine tuning on downstream tasks. These techniques show comparable performance to dense backbones at 80% or 90% sparsity.

Further Considerations

It is an interesting result that the optimal pruning method differs depending on whether linear or full fine-tuning is chosen. To investigate this result, we measured the fraction of ResNet50 backbones that are fully pruned convolutional filters. ( See Appendix E of the original paper.)

The results show that AC/DC has on average 2~4 more channels to be eliminated than the other methods. This result seems to lead to fewer features being trainable during full-fine tuning.

On the other hand, the sparsity of GMP and WoodFisher is unstructured, which may increase the number of features that can be expressed during full fine-tuning.

In the case of linear fine-tuning, the relationship with the gradual sparsification method is seen to be reversed, as the regularization method produces more robust features.

On Learning Speedup in Linear Fine Tuning

In linear fine-tuning, the sparse backbone is fixed, which leads to faster training time. To investigate this effect, we investigate the relationship between training time improvement and test accuracy change.

In general, we found that we could speed up the learning time by a factor of 2~3 without adversely affecting the accuracy.

additional experiment

In the appendix of the paper, experiments are conducted on ResNet18, ResNet34, and MobileNet, as well as on YOLOv3 and YOLOv5 to measure object detection task performance, with results that support previous analyses.

The results of full-fine tuning with structured sparse models also showed a tendency to decrease the transition learning performance compared to unstructured methods.

summary

Extensive analysis of the transition learning performance when sparsifying ImageNet pre-trained models using pruning methods shows that different performances can be obtained depending on the downstream task and learning settings, even though the ImageNet classification accuracy is comparable.

In particular, regularization-based methods tend to perform best during linear fine-tuning, while incremental sparsification methods perform best during full fine-tuning.

These analyses are limited by the fact that they focus on pruning among model compression techniques, the fact that the only metric of transition learning performance is accuracy, and the fact that more complex transition learning scenarios involving domain shifts are not included, and await further investigation.