Self-supervised ViT With Deep Fake Detection

Self-supervised Learning 29/07/2024

3 main points
✔️ Rapid growth of generative models has increased the demand for deep-fake detection in many areas
✔️ Despite their success in other tasks, ViTs are underutilized in deep-fake detection due to high demands on data and computational resources
✔️ We tested the adaptability and efficiency of self-supervised ViTs in deep-fake detection, with an emphasis on generalization on limited training data

Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis
written by Huy H. Nguyen, Junichi Yamagishi, Isao Echizen
(Submitted on 1 May 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In this commentary paper, we investigate how effective self-supervised, pre-trained transformers are compared to supervised, pre-trained transformers and traditional neural networks (ConvNets) in detecting deep fakes.

In particular, we focus on the potential for improved generalization when training data is limited. Despite the remarkable success of large-scale visual language models using transformer architectures for a variety of tasks, including zero-shot and few-shot learning, the field of deep-fake detection remains resistant to pre-trained vision transformers, including large-scale ones ( ViTs) as feature extractors remains resistant to adoption.

One of its concerns is that it often requires excessive capacity and does not provide optimal generalization when training or fine-tuning data is small or not diverse. This is in contrast to ConvNets, which has already established itself as a robust feature extractor. Furthermore, training and optimizing a transformer from scratch requires significant computational resources, which is largely limited to large companies and hinders widespread research within the academic community.

Recent advances in self-supervised learning (SSL) in transformers, such as DINO and its derivatives, have shown adaptability in a variety of visual tasks, with clear semantic segmentation capabilities. deep fake detection using DINO, limited training dataand implementing partial fine-tuning, we confirmed theadaptability to the task and the natural explainability of the detection results through theAttentionmechanism. Furthermore, partial fine-tuning of the transformer for deep-fake detection provides a resource-efficient alternative that can significantly reduce computational resources.

Proposed Method

Problem Formulation

As a basic binary classification problem, given an input image $ I $ and a pre-trained backbone $ B $ with the classifier head removed, the objective is to construct a network $ F $ that uses $ B $ to classify $ I $ as either "real" or "false". This can be expressed as

where $σ(-)$ is the sigmoid function, mapping the output of $F(B(I))$ to probabilities in the range [0, 1]. Also, $τ$ is the threshold value.

Although a softmax function can be used to convert the logits extracted by $F$ into probabilities, the use of softmax facilitates the extension from binary to multiclass classification. The backbone $B$ starts with a preprocessing module and consists of $n$ blocks. For simplicity, we denote the intermediate features of $I$ extracted by a block $i$ as $ϕ_i$. As for the value of $τ$, the method for determining its optimal value may vary from paper to paper. In this paper, $τ$ is set to 0.5 or to a threshold value corresponding to the equal error rate (EER) computed on the validation set, depending on the experimental setup.

Figure 1: Overview of the two approaches being considered

Approach 1: Use the frozen backbone as a multilevel feature extractor

In this approach, intermediate features $ϕ_i$ are further processed by the adaptor A (optional), fused with other intermediate features extracted by other blocks via the feature fusion operation $Σ$, and then classified by the classifier $C$, which is generally linear. This approach is the left side of Figure 1. The backbone $B$ remains frozen. The $K$ final intermediate features extracted by the $K$ final blocks are used. This is formalized as follows.

Approach 2: Fine tuning of the last transformer block

This approach is more direct than Approach 1. As shown in Figure 1 (right), a new classifier $C$ is added after the backbone $B$. This can be formalized as follows.

During fine tuning, the first $n - k$ blocks are frozen. For the transformer backbone, the class (CLS) token and register token (if present) are also unfrozen and fine tuned along with the final $k$ unfrozen blocks with the new classifier $C$. The two main advantages of this approach are: .

There are no additional parameters for the adaptor A and the feature fusion operation $Σ$. Avoiding additional parameters is advantageous because the sizes of modern feature extractors, especially transformers, are already sufficiently large.
（Because the final transformer block and tokens are fine-tuned (in the case of the transformer backbone), Attention weights to the CLS tokens are adapted for deep-fake detection. These can be used to naturally visualize the area of focus, similar to the visualization techniques used in DINO. This improvement improves the explainability of the detector, an important factor in deep-fake detection.

Experiment

Data Sets and Evaluation Indicators

Images generated or manipulated by various deep-fake methods were collected and used to construct the dataset. Details of the training, validation, and test sets are presented in Table 1. The datasets were designed to be balanced with respect to the ratio of real to false images and the number of images per training method, ensuring no overlap.
For the cross-dataset evaluation, a dataset constructed by Tantaru et al. was used that contains images generated or manipulated by diffusion-based methods. The training set is used for training or fine tuning the model and the validation set is used for hyperparameter selection. And the test set is used for evaluation and comparison.

The following were used for the evaluation indicators

Classification Accuracy $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $, where $ TP $ is true positive, $ TN $ is true negative, $ FP $ is false positive and $ FN $ is false negative.
True Negative Rate (TNR)．( \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \)
Equivalent Error Rate (EER): the value when the false positive rate (FPR) equals the false negative rate (FNR)
Half Total Error Rate (HTER)$ \text{HTER} = \frac{FPR + FNR}{2} $

Experimental results for Approach 1

Deepfake detection is the task of identifying deepfake fingerprints, such as artifacts and irregular patterns; since complete reliance on CLS tokens is not optimal, we evaluated the effectiveness of incorporating patch tokens and multiple intermediate features as well as the final block. We also compared the performance of two feature fusion techniques: weighted sum (WS) and concatenation (concat). Results were also validated for different sizes of DINO backbones and are presented in Table 2.

Table 2. EERs of the model of Approach 1 with DINO backbones of various versions and architectures

The "bigger is better" principle applies here. Larger backbone sizes generally result in lower EER. Utilizing all tokens yields much better results than relying entirely on CLS tokens. Also, utilizing multiple blocks provides better performance than using a single block, but training downstream modules can be more difficult as k increases. Feature concatenation yields better results than using weighted sums, and while there is generally no discernible difference in performance between DINO and DINOv2, in DINO there is no clear performance difference between using large and small patch sizes.

表3．SSLで事前学習されたDINOv2 - ViT-L/14-Regをバックボーンとするアプローチ1の強化

DINOv2 - ViT-L/14-Reg was selected (because of the balance between performance and model size). A simple linear adapter was used to allow for feature dimensionality reduction and feature linkage. Dropout was also applied to reduce overlearning. Results are presented in Table 3.

The optimal configuration uses a combination of dropouts, linear adapters, and feature concatenation. We applied this optimal configuration to EfficientNetV2, DeiT III, and EVA-CLIP and compared their performance to DINOv2. The results are displayed in Table 4: DINOv2 clearly outperformed EfficientNetV2 and DeiT III, and EVA-CLIP also performed well. These results highlight the advantages of using SSL for pre-training, allowing the user to learn superior representations that can be applied to multiple tasks.

Table 4. comparison with previous studies

Experimental results for Approach 2

DINOv2 - ViT-L/14-Reg was selected as representative of DINOv2, which was validated in detail in Approach 1. Similarly, EfficientNetV2, DeiT III, and EVA-CLIP were selected for comparison. The performance with fine-tuning the final block (and tokens in the case of the transformer) is shown in Table 5. Compared to Approach 1, all models performed better, narrowing the performance gap between DINOv2 and the other models, but EVA-CLIP was the closest competitor. Nevertheless, DINOv2 remained the top performer; to close the gap with DINOv2, EVA-CLIP would need to be pre-trained on a richly annotated and extensive data set. This is an expensive undertaking compared to DINOv2, which was pre-trained on a much smaller unannotated dataset. With the same architecture (DeiT III and DINOv2), the performance difference in terms of EER is almost 6%. Some of this difference could be caused by some of the different training data. Overall, these results again highlight the important advantages of using SSL for pre-training ViT.

Table 5. comparison of ConvNet and Transformer architectures in Approach 2.

Cross data set detection

This experiment evaluated the generalization capability of the detector to detect unknown deep fakes. This scenario is confirmed a tough competition. This is because the training set did not include diffuse images. The classification thresholds were recalibrated using the unknown validation set. Results are presented in Table 6. It should be noted that there was a decrease in the performance of all models. The best performer dropped from 11.32% to 27.61% in terms of EER. Overall, Approach 2 consistently outperformed Approach 1. Within Approach 2, EfficientNetV2 demonstrated better generalization capabilities than the other monitored pre-trained transformers; DINOv2 maintained its position as the top performer, reaffirming the advantages of using SSL in ViT.

Table 6. performance comparison between various ConvNet and transducer architectures on an unused test set consisting of images generated or manipulated with diffusion-based methods.

Conclusion

In this commentary paper, we proposed two approaches for using SSL pre-trained ViTs, especially DINO, as feature extractors for deep-fake detection.The first approach involved using a frozen ViT backbone to extract multi-level features. The second approach, on the other hand, involved performing partial fine tuning of the final k blocks.

Through multiple experiments, we found that the fine-tuning approach shows excellent performance and interpretability. The results of this study provide valuable insights to the digital forensics community on the use of SSL pre-trained ViTs as feature extractors in deep-fake detection.