New Frontier Of Deep Faking Detection Using CLIP
3 main points
✔️ Achieving State-of-the-Art with CLIP-based Fake Detection
✔️ New attempts to introduce CVaR loss and AUC loss
✔️ Employ optimization with SAM to improve generalization performance
Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images
written by Santosh, Li Lin, Irene Amerini, Xin Wang, Shu Hu
(Submitted on 19 Apr 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This study proposes a robust methodology for detecting fake images generated by diffusion models by integrating the multimodal image and language information obtained by the CLIP model. In particular, new attempts are made to improve the generalization performance of the model by introducing Conditional Value at Risk (CVaR) loss and Area Under the ROC curve (AUC) loss.
In addition, parameter optimization using Sharpness-Aware Minimization (SAM) is introduced to ensure generality as a method. As a result of these efforts, the authors' method outperforms conventional CLIP-based methods.
Figure 1 shows a comparison of their AUCs between this study and the previous CLIP-based methods outlined. This result shows that the method given in this study shows extremely high performance compared to the previous methods.
Background
The development of diffusion models has made it possible to generate extremely sophisticated fake images. On the other hand, due to their sophistication, the fake images provided by generative models pose a very serious problem for trustworthiness in digital media. In other words, the author points out that the faked images given by generative models are almost indistinguishable from real photographs, which can undermine credibility in a wide range of political, social, and personal spheres. In other words, establishing a methodology to distinguish these fake images from real photos, thereby providing a technology to ensure trustworthiness in digital society, is a challenge not only for the AI research area, but for society as a whole.
Experimental results
Proposed Method
Figure 2 provides an overview of the methodology given in this study. Below is a brief description of each of the key components of this methodology.
Feature design integrating multimodal information from text and images
The network underlying this research is CLIP. As shown in Figure 2, in this research, images and text are input to CLIP to extract features corresponding to each modal, which are then integrated and input to MLP for fake detection.
Loss function design
In this paper, optimization is attempted by the following loss functions
Each term in the equation is discussed below. Also, $\gamma$ is a hyperparameter that determines the balance of each term.
Conditional Value-at-Risk (CVaR) Loss
CVaR loss is designed so that the model focuses on the most difficult examples in the data set and is defined by the following equation
In this equation, $[a]_{+}=max\{0,a\}$. Also, $l$ is the loss function for class classification and ${F_{i},Y_{i}}$ is the pair of features and class labels. Furthermore, $n$ is the total number of data and $\alpha$ is the hyperparameter. This formula considers the minimum value for $\lambda$, but as $\lambda$ gets smaller, the non-zero $I$ increases in order from the data for which the second term increases the loss $L$. On the other hand, for data${F_{i},Y_{i}}$ that give $l$ smaller than the threshold $\lambda$,they are ignored. In this sense, it can be said that the loss function is designed to contribute to optimization by focusing on data that increase the loss.
AUC Loss
AUC loss, as the name implies, is designed to achieve equal optimization that directly contributes to AUC improvement. Definitions are as follows.
However, where
It is defined as In the equation, $\eta\in(0,1], p>1$ and $s(\theta; F_i)$ denotes the score function. In other words, this definition is designed to increase the margin between positive and negative cases so as to improve the AUC.
Optimization technique
In this study, Sharpness-Aware Minimization (SAM) is considered as an optimization technique. This optimization method is designed to search for parameters that are flat near their minimum values, rather than simply searching for smaller values of the loss function. As a result, the model is said to be expected to acquire generalization performance.
Experimental results
Comparison with Baseline
The datasets used for validation during this study were the real images contained in LAION-400M and four corresponding fake images. The fake images were created by Stable Diffusion 1.4, 2.1, XL, and DeepFloyd IF. As a baseline for validation, we also usedthe method of fake detection byMLP trained by binary cross-entropy lossfrom featuresgiven by the image encoder in CLIP(Traditional 1) and from text and image encoded features on the basis of CLIPThe other method (Traditional 2)uses an MLP trained by binary cross-entropy loss todetect fakes from text- and image-encoded features based on CLIP .AUC was used as a metric for verification.
Table 1 shows a comparison in terms of AUC between the respective baselines and the method given by this study. The results show that the method given by this study performs better than the conventional method.
Ablation Study
The authors performed an ablation study on each of the items introduced in this study to determine how much CVaR loss, AUC loss, and SAM each contribute to improved performance. Table 2 shows the results of the ablation studies for each. The results show that CVaR loss, SAM, and AUC loss, in that order, contribute to performance improvement in terms of AUC.
Change in landscape of loss function due to SAM
Supplementary, the authors have visualized the change in the landscape of the loss function due to the use of SAM for the purpose of visualizing the effect of SAM. Figure 3 shows the change in the landscape of the loss function due to the introduction of SAM. This result suggests that, indeed, the introduction of SAMs does result in the selection of an optimal solution with a flat perimeter.
Summary
To establish a new methodology for deepfake detection using CLIP, this study proposed a detection method that integrates the use of features from text and images. In particular, the adoption of a loss function that fuses CVaR loss and AUC loss and the introduction of parameter optimization by SAM make this research remarkable.
This study focuses on fake images created by generative models, but whether it will work universally for fake images created by GANs will be a future point of discussion. Although the authors' ambitious attempt is limited, it breaks new ground in fake detection technology, and future developments are expected.
Categories related to this article