Detecting Fake Images With CLIP: Image-Language Model For Fake Detection

Fake Detection 25/05/2024

3 main points
✔️ Fake detection using CLIP, a multimodal model of image and language
✔️ Comparing and investigating optimal strategies for CLIP transition learning
✔️ Prompt optimization achieves state-of-the-art with respect to generalization performance

CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection
written by Sohail Ahmed Khan, Duc-Tien Dang-Nguyen
(Submitted on 20 Feb 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This study discusses how to detect fake images using CLIP, a multimodal underlying model of image and language. The key novelty of this study is that it establishes a method for faked image detection using multimodal image and language information, whereas faked image detection has generally been based only on image data. In particular, we have achieved state-of-the-art generalization performance by comparing and examining various transition learning strategies for using CLIP for fake detection.

Background

Importance of Fake Image Detection

In recent years, with the remarkable development of generative models such as adversarial generative networks (GANs) and diffusion models, it has become possible to generate fake images that are difficult even for humans to discern. On the other hand, such high-definition fake images are potentially risky because they can lead to fabrication of news. For example, the fabrication of racist remarks by government officials could lead to international problems. Therefore, the establishment of a generic method to detect fake images is an important social issue.

Technical Difficulties in Fake Image Detection

The technical difficulty of fake image detection is the diversity of its generative models. In other words, a methodology must be established to discriminate fake images in a robust and generic manner in the face of the increasing diversity and complexity of the generative models. However, this makes the detection of fake images technically difficult against the backdrop of the fact that deep learning, a fundamental technique in artificial intelligence, is interpolative, making it difficult to estimate regions located outside the distribution of the training data set. This research attempts to solve this difficulty with the rich expressive power of CLIP, a multimodal underlying model of images and language, and is a new trend in fake image detection.

Related Research

CLIP (Contrastive Language-Image Pre-training)

CLIP is a multimodal underlying model of image and language, pre-trained on a large dataset consisting of images and their associated text captions. CLIP's rich expressive power is also of interest for fake image detection. In fact, visualization of the feature domain space acquired by CLIP shows that real and fake images are well separated (Figure 1).

t-SNEを用いたいくつかのモデルを用いた特徴量空間におけるリアルイメージ（赤）とフェイクイメージ（緑）の可視化。 — Figure 1. visualization of the distribution of real (red) and fake (green) images in feature space for several models using t-SNE.

Proposed Methodology: Four Transition Learning Strategies

In this study, we organize, compare, and discuss the following four strategies for transition learning in applying CLIP to fake detection.

転移学習の戦略 — Figure 2. four transition learning strategies for fake detection. The number of parameters to train when using each strategy is shown in the lower right corner.

Prompt Tuning

Using a method called Context Optimization (CoOp), this method trains with a policy of optimizing the prompts that are entered into CLIP's language encoder. The input prompt itself is the training target.

Adaprer

This method adds a lightweight linear layer to the image encoder and trains only with respect to it, without changing any parameters with respect to the language and image encoders included in CLIP.

Fine Tuning

All parameters of CLIP are re-trained in the context of the fake detection task. The total number of training parameters is the largest.

Linear probing

This method uses only the image encoders included in the CLIP and thereby regresses real/fake using a linear layer from the features output for each image.

Experimental results

For each of the transition learning strategies, the models were trained using only datasets generated by ProGAN, and the generalization performance was tested with datasets obtained from 21 different GAN-based, diffusion model, and commercial image generators. Table 1 shows the details of the 21 different datasets prepared.

検証データセット — Table 1. 21 different datasets used for validation.

Generalization performance

The authors tested the generalization performance of the trained model using various data sets. Table 2 shows a comparison of the accuracy using each dataset. Compared to previous studies, the results suggest an advantage for the method in this study, which integrates multimodal information from images and language. In particular, we also find that Prompt Tuning is the best strategy for transition learning in CLIP. These results suggest that it is useful to integrate multimodal information from image and language for fake detection, whereas it has been common to use only image data for fake detection, and thus, we feel that this is a new trend in fake detection.

On the other hand, it can be confirmed that the performance of Face Swap is less accurate than other datasets, including previous studies. In other words, while Face Swap is highly accurate for cases where the entire image is generated, such as GAN and diffusion models, it may be inaccurate for cases where a portion of the image is edited or replaced, such as Face Swap, and further discussion is needed.

Table 2. Comparison of accuracy using each dataset. The best performance is shown in bold.

Impact of training data set size

The authors also examine the impact of the size of the training dataset on performance, assuming a real-world use case where only a limited number of images are available. Table 3 summarizes the performance of the model for cases with several sizes of training datasets. From these results, the authors conclude that there was no significant difference in the performance of the model when the size of the training data set was changed. This means that the strategies considered in this study are valid for real-world use cases with only a limited number of data.

データセットサイズ — Table 3. impact of training dataset size on performance.

The authors further discussed and evaluated the performance of the model by training it on a training dataset consisting of only 32 images (16 real/16 fake) from each image category, for a total of 640 images. This validation also demonstrated the usefulness of the authors' proposed method and showed that Prompt Tuning was more significant than the other strategies.

Table 4. performance of the few-shot trained model.

Impact on image post-processing performance

Considering when images are shared online in the real world, it is common for images to be post-processed. It is also generally recognized that post-processing can have a significant impact on the performance of fake detection. With these backgrounds, the authors also discuss the changes in detection performance when images are subjected to some post-processing. As actual post-processing, (1) JPEG compression and (2) Gaussian filtering were considered in the paper. Figure 3 summarizes the robustness of the model to each of these transformations. Interestingly, Linear Probing shows the most robust performance in this validation.

後処理 — Figure 3. Impact of image post-processing on performance.

Summary and Conclusion

In this study, we comprehensively tested the robustness of fake detection using CLIP on various fake image datasets. We also compared and investigated four different transition learning methodologies for applying CLIP to fake detection: fine-tuning, linear probing, Prompt Tuning, and Adapter Network strategies. Experimental results suggest that the integration of multimodal image and language information by CLIP is also effective in the context of fake detection. This foreshadows a new trend in fake detection technology in the future, as well as the further development of fake detection techniques for other kinds of fake images, such as Face Swap.