# Can We Protect The Privacy Of Deep Learning Models?

3 main points
✔️ Propose a "Nasty Teacher" to prevent model replication and reproduction by knowledge distillation (KD)
✔️ Significantly reduce the performance of student models while maintaining the same performance as normal models

✔️ Demonstrate immunity to knowledge distillation (KD-immunity) through experiments under various conditions

written by Haoyu MaTianlong ChenTing-Kuei HuChenyu YouXiaohui XieZhangyang Wang
(Submitted on 29 Sept 2020)

Subjects: knowledge distillation, avoid knowledge leaking

## First of all

Knowledge Distillation reduces the model size while maintaining performance by mimicking a trained teacher model with a (lighter) student model.

This technique can be useful in a variety of situations, including real-world applications of models that are too large to be used in practice. However, it can also cause significant problems in some cases.

For example, if a publicly available deep learning model (even a black box) is used as a teacher model for knowledge distillation, that model (which may have been created at a high cost) can be replicated and reproduced without permission.

In order to prevent the misuse of knowledge distillation, this article proposes Self-Undermining Knowledge Distillation, a method to create a teacher model called "Nasty Teacher", which trains the model so that the student model cannot perform effectively even if knowledge distillation is used. Distillation is proposed.

A model trained by this method can significantly degrade the performance of the student model when knowledge distillation is performed using it as a teacher model.

We begin with a formulation of knowledge distillation.

Consider a pre-trained teacher network $f_{\theta_T}(⋅)$ and a student network $f_{\theta_S}(⋅)$, where $\theta_T,\theta_S$ are the parameters of the network.

The goal of knowledge distillation then is to bring the output probability of $f_{\theta_S}(⋅)$ close to that of $f_{\theta_T}(⋅)$.

Let the training sample of the data set $X$ be $(x_i,y_i)$ and $p_{f_\theta(x_i)}$ be the logit of $x_i$ by $f_{\theta}(⋅)$, then $f_{\theta_S}(⋅)$ is trained by the following equation.

where $KL(⋅)$ denotes KL-divergence and $X\varepsilon$ denotes the cross-entropy loss.

Roughly speaking, $\alpha \tau^s_S KL(\sigma_{\tau_S}(p_{f_{\tau_T}}(x_i)),\sigma_{\tau_S}(x_i)))$ allows the student network to learn to mimic the output of the teacher network, and $(1-\ alpha)X\varepsilon(\sigma(p_{f_{\theta_S}}(x_i)),y_i))$ allows the student network to learn to improve the performance of the task.

The $\alpha$ is a hyperparameter that represents the trade-off between imitating the teacher network and improving the performance of the task.

The $\sigma_{\tau_S}$ is the softmax temperature function with temperature. The larger the value of $\tau_S$ (greater than 1), the softer the distribution of the output (1 is the same as the normal softmax function).

## Self-Undermining Knowledge Distillation

### Rationale (Nasty Teacher)

The goal of Self-Undermining Knowledge Distillation, which creates a teacher network called Nasty Teacher, is to train a special teacher network so that the student network cannot perform knowledge distillation (i.e., perform no better than it would if it learned as usual). ).

Let $f_{\theta_T}(⋅)$ be the Nasty Teacher network and $f_{\theta_A}$ be the adversary (trying to distill knowledge from the Teacher network) network.

We train the Nasty Teacher to maximize the KL-divergence between the Nasty Teacher and the adversary network. This is represented by the following equation.

You can see that it is very similar to the previous equation, but note that this is an equation that represents the learning process of the teacher network.

Here, $X\varepsilon(\sigma(p_{f_{\theta_T}}(x_i)),y_i))$ is a cross-entropy loss term that aims to improve the performance of the teacher network on the task.

On the other hand, $-\omega tau^2_A KL(\sigma_{\tau_A}(p_{f_{\tau_T}}(x_i)),\sigma_{\tau_A}(x_i)))$ is a term that aims to maximize the KL-divergence between teacher and student networks. term (since the sign is negative, the larger the KL-divergence, the smaller the overall value).

$\tau_A$ represents the temperature for the softmax function with temperature, and $\omega$ represents the trade-off between task performance and KL-divergence maximization.

Except for the change of the sign of the KL-divergence term and the partial replacement of the teacher network and the student network, it consists of formulas very similar to those of ordinary knowledge distillation and can be said to be a very simple idea.

We do not make any assumptions about the network architecture. Therefore, for both $f_{\theta_T}$ and $f_{\theta_A}$, we will use the same architecture to learn (we will verify the case where $f_{\theta_A}$ is changed during the ablation study of the experiment).

When training Nasty Teacher, $f_{\theta_A}$ uses the pre-trained model fixedly, and only $f_{\theta_T}$ is updated.

experiment

In order to verify the effectiveness of Nasty Teacher, we train the network based on the previously described equation and then verify the performance of knowledge distillation on an arbitrary student network.

### Experiment setup

We use CIFAR-10, CIFAR-100, and Tiny-ImageNet for the datasets.

#### ・Network

In CIFAR-10, we use ResNet18 as the teacher network and a 5-layer CNN as the student network. We also replace the student network with ResNetC-20/ResNetC-32 to investigate the impact of changes in the student network.

In CIFAR-100 and Tiny-ImageNet, ResNet-18, ResNet-50, and ResNeXt-29 are used as teacher networks. Also, MobileNetV2, ShuffleNetV2, and ResNet-18 are used as student networks.

In addition, as a "Teacher Self" configuration, the same architecture is used for the teacher and student networks.

Hyperparameters

The temperature $\tau_A$ is set to 4 for CIFAR-10 and 20 for CIFAR-100 and Tiny-ImageNet (the same value as $\tau_S$ during knowledge distillation).

ω is 0.004 for CIFAR-10, 0.005 for CIFAR-100, and 0.01 for Tiny-ImageNet.

## experimental results

The experimental results for CIFAR-10, CIFAR-100, and Tiny-ImageNet are shown in the table below, respectively.

You can see that Nasty Teacher has a maximum performance loss of only 2% compared to normal.

And it was shown that knowledge distillation on normal networks improved the performance of student networks by up to 4%, while knowledge distillation on Nasty Teacher reduced accuracy by 1.72% to 67.57%.

We also see that weaker student networks (e.g., MobilenetV2) perform much more poorly than stronger student networks (e.g., ResNet-18).

Even if the teacher network and the student network are identical (Teacher Self), the performance degradation is consistently the same.

These results may indicate that knowledge distillation from Nasty Teacher is very difficult and the ability of knowledge distillation to prevent replication and reproduction of the model.

### qualitative analysis

To explore the difference between Nasty Teacher and normal learning, an example of logit response in CIFAR-10 with ResNet-18 is shown below.

The logit responses of ResNet-18 trained as usual (blue vertical bars) all consist of nearly a single peak.

On the other hand, Nasty Teacher (light yellow) shows that there are multiple peaks.

We can intuitively assume that if we perform knowledge distillation from such a teacher network, the student network is likely to acquire incorrect knowledge.

The visualization of the feature embedding and output logit by t-SNE is shown in the following figure.

The upper part of the figure shows the visualization of the feature embedding and the lower part shows the visualization of the output logit.

There is no significant change in the inter-class distance in the feature space between normal and Nasty Teacher, indicating that Nasty Teacher behaves similarly to a normal teacher network.

On the other hand, the logit output has changed significantly. This means that Nasty Teacher is mainly changing the weights of the last all-joining layer.

## Ablation Research

If we change the adversary network $f_{\theta_A}$ used when training Nasty Teacher (i.e., we use different architectures for the teacher network and the adversary network), we get the following.

Of the tables, ResNet18 (ResNet18) uses the same architecture (ResNet18) for the teacher and adversary networks as before. Comparing with the other cases, we can see that Nasty Teacher is effective in general.

However, it should be noted that weak networks (e.g., CNNs) may degrade the performance of the teacher network.

About $\omega$

Next, the results of varying the hyperparameter $\omega$ from 0 to 0.1 are shown below.

In the figure, T represents the teacher network and S represents the student network.

It can be seen that by adjusting ω, we can control the trade-off between the performance of the teacher network and the performance degradation of the student network during knowledge distillation.

About $\tau_S$.

The results of varying the temperature parameter $\tau_S$ during knowledge distillation are shown below.

In all cases, the performance of the student network is generally degraded, but we can see that the larger $\tau_S$ is, the more the performance of the student network is degraded.

About $\alpha$.

The value of $\alpha$ was set to 0.9 by default, but if you vary it from 0.1 to 1.0, you will see the following

Regardless of how the value of $\alpha$ is chosen, the performance of the student network is generally degraded.

The smaller $\alpha$ is, the higher the performance of the student network is, but this means that the degree of knowledge distillation (minimization of KL-divergence) from the teacher network is decreasing, so it is still difficult to perform knowledge distillation from the Nasty Teacher, Therefore, it is still difficult to distill knowledge from Nasty Teacher.

About the ratio of training samples

Considering that the student network does not have access to all the training data, the performance of varying the percentage of training samples is as follows

Consistently in all cases, we find that the student network is adversely affected by the knowledge distillation from the Nasty Teacher compared to that from the regular teacher network.

### Data-free knowledge distillation

We evaluate the performance of Nasty Teacher on state-of-the-art data-free knowledge distillation methods ( DAFL, DeepInversion ), taking into account the case of using a method that can perform knowledge distillation even when the dataset used to train the teacher network is inaccessible (data-free knowledge distillation). We evaluate the performance of Nasty Teacher.

Initially, the experimental results for the application of DAFL are as follows.

Compared to using regular ResNet34, Nasty Teacher has successfully reduced the performance of the student network by more than 5%.

In addition, the following is an example visualization of a trained teacher network attempting to recover the data used during training by DeepInversion.

Compared to the images generated from regular ResNet-34, the images generated from Nasty Teacher contain distorted noise and erroneous class features, indicating that they may also deter the reconstruction of training data by reverse engineering.

## summary

Knowledge distillation is a very useful technology, but it also creates the risk that published models can be replicated and reproduced. The existence of this problem leads to a potential risk of publishing and making models available.

In some cases, fears of replication through knowledge distillation may lead to many deep learning models not being published, stifling the growth of the community.

In the paper presented in this article, we are able to significantly reduce the performance of the student model with knowledge distillation, while performing almost as well as the normal model.

This technology leads to "privacy protection of deep learning models", which is significant research that could be a solution to the aforementioned problem.

If you have any suggestions for improvement of the content of the article,