Backdoor Attack On Self-supervised Learning

Backdoor Attack 21/09/2022

3 main points
✔️ Backdoor attack against self-supervised learning methods
✔️ Validated targeted attacks that inject tainted data into specific categories
✔️ Successful backdoor attacks against MoCo, BYOL, MSF, and other SSL methods

Backdoor Attacks on Self-Supervised Learning
written by Aniruddha Saha, Ajinkya Tejankar, Soroush Abbasi Koohpayegani, Hamed Pirsiavash
(Submitted on 21 May 2021 (v1), last revised 9 Jun 2022 (this version, v3))
Comments: CVPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Self-supervised learning methods (e.g., MoCo, BYOL, MSF) for learning visual representations on large unlabeled data sets have had great success in recent years.

However, using large amounts of unlabeled data for training can make the system more vulnerable to backdoor attacks because of the high cost of checking that the data has not been contaminated by an attacker.

In the paper presented in this article, we introduce a paper that successfully performs a backdoor attack against self-supervised learning, which has been mainly studied against supervised learning.

Attacker's objective

First, we consider the attacker's set up in the case of attacks against self-supervised learning (SSL) models.

Here, the attacker's goal is to insert a backdoor in the SSL model and force the classifier to make incorrect predictions on inputs that contain certain patches (triggers) when the model is used as the backbone of a downstream task classifier. It can also make it harder to detect backdoors by allowing the classifier to perform as well as a clean classifier on inputs that do not contain specific patches.

The SSL model can learn almost the same features as supervised learning without annotation, and recently it has been possible to use large datasets created by downloading public images from the web, such as Instagram-1B and Flickr image datasets.

In these cases, it is not difficult for an attacker to introduce contaminated data because images from the Web are used for SSL without scrutiny.

Knowledge and ability of the attacker

By exposing the tainted data on the web, an attacker can inject some of the tainted data into the automatic collection of images on the web for SSL model training.

At this time, the attacker has no control over the training of the SSL model and has no information about the architecture, Optimizer, and hyperparameters of the model.

targeted backdoor attack

Here's how to perform a backdoor attack against the SSL model

Generate tainted images: paste a trigger (image patch) on an image of a specific category and inject this into the training set. The category that contains the tainted image becomes the target category.
Self-supervised pre-training: visual features are learned on the contaminated dataset by the SSL algorithm.
Transfer learning to supervised tasks: features learned in the SSL model are used to train a linear classifier in a downstream supervised task.
During testing: if the attack is successful, the classifier in the downstream task performs well for clean images, but incorrectly predicts the target category for images containing triggers.

experimental setup

Data Set

The dataset used in our experiments of backdoor attacks against the SSL model is as follows.

ImageNet-100: A random 100-class subset of ImageNet, often used as a self-supervised benchmark.
ImageNet-1k: An ImageNet dataset consisting of 1.3 million images of class 1000.

Back door trigger

For backdoor triggers, use the HTBA (Hidden Trigger Backdoor Attacks) public trigger. This will be a square trigger with a random 4x4RGB image modified to the desired size by bilinear completion. In our experiments, the triggers are indexed from 10 to 19, and we use triggers corresponding to the same index when comparing different methods to improve reproducibility.

Self-teaching method

In our experiments, we use the following six self-supervised methods.

MoCo v2:ResNet-18 is used as the backbone.
BYOL: Use ResNet-18 as the backbone.
MSF: ResNet-18 is used as the backbone.
Jigsaw
RotNet
MAE (Masked Auto-Encoders): Use ViTB as a backbone.

Evaluation of features

The SSL model is evaluated by training a linear classifier on a downstream supervised task. When training the linear classifier, no contaminated images are included in the training set.

Targeted Attacks on ImageNet-100

First, we experiment with targeted attacks on random categories of ImageNet-100.

Choose a random trigger from the HTBA triggers and set the size to 50x50. Paste this trigger at a random position in the image and contaminate half of the images in the selected category. The number of contaminated images is about 650 and the injection rate is 0.5%. Note that we use the contaminated training set when training the SSL model and 1% or 10% of the clean training set when training the linear classifier.

We use the ImageNet-100 validation set to evaluate the linear classifiers and measure their performance with and without additional triggers.

The results of training a linear classifier with 1% of ImageNet-100 are shown below. Note that 10 experiments have been conducted with different target class-trigger pairs.

In general, MoCov2, BYOL, and MSF saw a significant increase in the number of false positives (FPs) for patched data, indicating that backdoor attacks are effective.

On the other hand, Jigsaw and RotNet, which are not emplar-based methods, and MAE, which is a very new method, did not show much effect. Also, the results of training a linear classifier on 10% of ImageNet-100 are as follows.

Similarly, for this case, we found that the backdoor attack is effective for MoCov2, BYOL, and MSF. Note that the following is an example of when the backdoor model fails to predict.

When the injection rate is changed

The change in the number of false positives when the injection rate is changed from the previous experiment (0.5%) is shown below.

Injection rates of 1%, 0.2%, 0.1%, and 0.05% were tested, and the success rate of the attack decreased with lower injection rates, with the smallest rate of 0.05% being close to the clean model.

Note that ImageNet-100 contains about 1300 images per category, but a large unlabeled dataset with a larger number of images per category may be more prone to successful targeted attacks even with a lower injection rate.

Targeted Attacks on ImageNet-1k

Next, we conduct experiments on ImageNet-1k. Note that since ImageNet-1k has a large number of classes, we experiment with an injection rate of 0.1% by contaminating all images in a single target category.

The results in MoCo v2 at this time are as follows.

In addition, using the hierarchical structure of WordNet, if we create a superclass about a feline family consisting of 10 subclasses and contaminate 1/10 of each category, 5 of the top 10 classes with high FP will be in the feline category, and so on, the backdoor attack by superclass unit is also effective We found that a backdoor attack per superclass is also effective.

Non-Targeted Attacks on ImageNet-100

Unlike previous work, we have experimented with a non-targeted attack that randomly contaminates 5% of the training images, and the results are as follows.

This attack results in a 5-point drop in model accuracy, but the overall accuracy drop is smaller than that of targeted attacks.

This is likely because trigger patches exist in various categories, making it harder for the SSL model to associate a trigger with a specific category.

On defensive methods

The success of backdoor attacks against SSL models may be due to the property of SSL methods that learn to bring two embeddings with different Augmentations applied to an image closer together so that a trigger is strongly associated with a particular category. This is also suggested by the fact that the classical SSL methods Jigsaw and RotNet are not effective in targeted attacks.

However, the performance of classical methods is lower than that of modern methods, so it is desirable to establish some defensive methods.

In the paper, the defense against backdoor attacks is to perform knowledge distillation (using ComPress in the paper) on a small clean data set to avoid the effects of backdoors. The results are as follows

As shown in the table, knowledge distillation on a subset (25%, 10%, and 5%) of ImageNet's clean datasets can significantly reduce the effectiveness of backdoor attacks.

Feature Space Analysis

Finally, the feature space visualization of the backdoor and clean models is shown below.

As shown in the figure, the embedding of images containing triggers (Patched Data) is distributed close to the target category images in the backdoor model and almost uniformly in the clean model.

We found that in the backdoor model, the false positives of the target category increase as the image containing the trigger becomes closer to the target category image in the embedding space.

summary

It is shown that a backdoor attack can be performed by injecting contaminated images against a training set of self-supervised learning and showing images that contain triggers to a linear classifier trained on a downstream task.

We found that this attack is effective in SSL methods such as MoCo v2, BYOL, and MSF, which learn to make the embedding of two images with different Augmentations applied to the same image closer.

While the success of recent SSL models relies on the ability to use large unlabeled data sets, it has become clear that there is a concomitant risk of contaminated data being introduced by an attacker.

Addressing these vulnerabilities may be important in the development of future SSL methods.