What Are The Conditions For Effective Teachers In Knowledge Distillation?

Knowledge Distillation 27/09/2022

3 main points
✔️ Explore effective ways to achieve successful knowledge distillation
✔️ Identified the importance of consistent and patient teachers
✔️ ResNet-50 model achieves ImageNet 82.8% Top-1 accuracy

Knowledge distillation: A good teacher is patient and consistent
written by Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, Alexander Kolesnikov
(Submitted on 9 Jun 2021 (v1), last revised 21 Jun 2022 (this version, v2))
Comments: CVPR2022.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Very large models have shown state-of-the-art performance in computer vision tasks such as image classification, object detection, and semantic segmentation.

However, due to their high computational cost, smaller models such as ResNet-50 and MobileNet are more commonly used than high-performance large-scale models.

In the paper presented in this article, we address this problem by identifying more effective ways to compress large models and achieve better performance through Knowledge Distillation.

The results show that by using the same input for the teacher and student models and increasing the training time with active augmentation, we achieve excellent results on various vision datasets, especially the Top-1 accuracy of 82.8% on the ResNet-50 model in ImageNet.

experimental setup

In our experiments, we compress large visual models (teacher models) that show high accuracy on a particular task into smaller models (student models) with minimal performance loss.

The following five datasets are used in the experiment.

flower102
pets
food101
sun397
ILSVRC-2012 (ImageNet)

These are diverse image classification tasks, ranging from 37 to 1000 classes and 1020 to 1281167 images.

Classification accuracy is used as the evaluation index.

Teacher/Student Model

As a teacher model, we use a model from BiT (a large collection of ResNet models pre-trained on ILSVRC-2012 and ImageNet-21k) in our experiments. One major difference from standard ResNet is that instead of Batch Normalization, a Group Normalization layer and Weight Standardization are used.

We use a variant of BiT-ResNet-40 as the student model (henceforth referred to as ResNet-50 for simplicity).

Distillation loss

We use the KL-divergence between the predicted class probability vectors $p_t,p_s$ of the teacher and student models for the distillation loss.

$C$ is the class set. We also use the temperature parameter $T$ ($p_s \propto exp(log p_s/T), p_t \propto exp(log p_t/T)$).

experimental results

In the paper, we interpret knowledge distillation as the task of matching functions in the teacher-student model.

Based on this interpretation, we conclude the following two principles in knowledge distillation for model compression

The teacher-student model should process the same (the same clop Augmentation) input consistently ("consistent teacher").
To improve generalization, aggressive Augmentation processing and learning over many epochs ("patient teacher").

The importance of a consistent teacher

First, to test the "consistent teacher" hypothesis, we consider the following four options for knowledge distillation.

Fixed teacher: Fix the predictions of the teacher model.
- fix/rs: Both teacher and student models are input images resized to 224x224.
- fix/cc: crop the center for the teacher model and random crop for the student model.
- fix/ic_ens: use the average of the predictions for 1k different crops as a prediction for the teacher model (inception crop). The student model crops a portion of the image at random.
Independent noise: different inputs are given in the teacher-student model.
- ind/rc: apply independent random crops on the teacher and student models.
- ind/ic: apply independent inception crops on the teacher and student models respectively.
Consistent teaching: The same input is given in the teacher-student model.
- same/rc: apply the same random crop on teacher and student models.
- same/ic: Apply the same inception crop to the teacher and student models.
Function matching: identical inputs (+Augmentation) are given in the teacher and student models.
- same/ir,rc, mix: An extension of Consistent teaching, which performs mixup processing on images to increase diversity, and then gives the same input to the teacher and student models.

For these settings, the learning curve for 10,000 epochs on Flowers102 is as follows

As shown in the figure, the consistent teacher (same/rc, same/ic) setting shows better results, indicating the importance of consistent input in the teacher-student model. A comparison of the train/val curves also shows that overfitting occurs when the teacher's predictions are fixed (fix, black line).

The importance of a patient teacher

In the case of normal supervised learning, aggressive image augmentation runs the risk of significantly distorting the actual image relative to the image labels.

However, if we interpret knowledge distillation as a matching process of teacher-student model functions and provide consistent and identical inputs to the teacher-student model, we can perform image enhancement aggressively because it is effective for function matching even if the inputs are highly distorted.

Based on this idea, we examine the case of very long-time optimization (patient teacher) while avoiding overfitting by aggressive image enhancement. The results are as follows.

This figure shows the test accuracy when training with a different number of epochs for each dataset.

As shown in the figure, we can see that after training with a very large number of epochs, the student model finally reaches the performance of the teacher model (red line). It is also worth noting that no overfitting has occurred even after training as many as 1M epochs. And compared to learning from scratch or learning by transition, the final results are inferior for small epochs but outperform.

Scaling up to ImageNet

Since the above experiments were performed on a relatively small dataset, we will perform the same experiments on a larger ImageNet. The results are as follows.

Similar to the above experiments, we see that the CONSISTENT teaching setting does not overfit and the performance improves with increasing learning time.

The function matching set, which actively performs augmentation processing, shows better performance with longer learning time, although underfitting occurs when the number of epochs is small.

Finally, the ResNet-50 student model achieved a Top-1 accuracy of 82.31% on ImageNet.

Knowledge distillation at different resolutions

In previous experiments, the teacher and student received the same resolution (224x224) input. However, by decreasing the resolution in the student model, we may be able to achieve faster processing.

So, the number of epochs and Top-1 accuracy when the input image resolution in the student model is smaller than that in the teacher model are as follows.

The table shows that knowledge distillation works well even when the resolution of the student and teacher models is different.

It is also shown that knowledge distillation from a higher resolution and higher accuracy teacher model (S384→S224) can achieve better performance even when the student model has the same 224x224 resolution.

Improvement of learning efficiency by second-order preconditioner

The function matching set, which actively performs augmentation processing, requires more learning time, although the final performance is higher.

We now test whether using a more powerful Optimizer (Adam→Shampoo) can reduce the increase in learning time. The results are as follows

As shown in the figure, we succeeded in improving the learning speed by 4 times by using Shampoo instead of Adam.

On the use of pre-trained models

Based on the success of transfer learning, the following results were obtained when the student model was initialized with the pre-trained model.

When the learning time is short, initialization with pre-trained models shows good results. However, when the training time is longer, learning from scratch eventually achieves better performance.

Knowledge distillation in different model families

Given that knowledge distillation is successful at different resolutions, we also examine knowledge distillation across different model series.

First, when the student model was changed to MobileNet v3 (Large), we achieved a Top-1 accuracy of 74.60 after 300 epochs and 76.31 after 1200 epochs. Also, when the student model was ResNet50 and the teacher model was set to the ensemble setting (224x224 default + 384x384 logit average), a Top-1 accuracy of 82.82 was achieved after 9600 epochs.

In general, we found that knowledge distillation is successful even if the teacher and student models are different model architectures or the teacher model is in an ensemble setting.

Comparison with existing methods

The best results from these experiments and a comparison with existing ResNet-50 models are shown below.

In general, the knowledge distillation setup presented in the paper outperforms existing state-of-the-art results.

Knowledge distillation in "out-of-domain" data

If we view knowledge distillation as function matching, we expect knowledge distillation to be effective for arbitrary image inputs.

To examine this hypothesis, we perform experiments on the pets and sun397 datasets. Specifically, we perform knowledge distillation on food101 and ImageNet images (out-of-domain) and compare it to knowledge distillation on ipets and sun397 images (in-domain). The results are as follows.

In general, the in-domain knowledge distillation performed best, but the out-of-domain images also showed that knowledge distillation works to some extent.

We also found that when the domains are related or overlapping (pets and ImageNet, sun397, and ImageNet), we can achieve performance close to in-domain, although longer training time is required.

Comparison with no distillation loss

Finally, to confirm that these experimental results are not due to the peculiar learning setup (aggressive mixup augmentation and long learning time), we compared our results with those obtained by eliminating the distillation loss and performing normal supervised learning.

As shown in the figure, when active mixup augmentation and long-term learning were used without knowledge distillation, performance was degraded and overfitting occurred. Thus, it was shown that active mixup augmentation and long-term learning work well when combined with knowledge distillation.

summary

Rather than proposing a new method of model compression, we introduced a paper that reconsidered the existing knowledge distillation process and proposed a more effective learning process. The results are based on a new interpretation of knowledge distillation as "function matching of teacher-student models".

We then showed that (1) making the teacher and student inputs identical, (2) actively applying augmentation to increase input diversity, and (3) increasing the learning time can improve knowledge distillation performance.

Based on these findings, we have achieved significant results by compressing large models into ResNet-50, outperforming existing state-of-the-art performance and providing a strong baseline for future research.

Categories related to this article

Knowledge Distillation

anonymous