Catch up on the latest AI articles

Diffusion Facial Forgery (DiFF), A New Large-scale Dataset For Face Forgery Detection

Diffusion Facial Forgery (DiFF), A New Large-scale Dataset For Face Forgery Detection

Face Recognition

3 main points
✔️ Development of the "DiFF" diffuse face forgery dataset: A large dataset containing over 500,000 high-quality diffuse-generated face images was built to improve face forgery detection techniques.
✔️ Designing diverse and accurate prompts: three types of prompts, including textual and visual prompts, to ensure high-quality and diverse image generation.
✔️ Advances in forgery detection techniques: proposed a new method based on edge graphs and integrated it into existing models to significantly improve the accuracy of forgery detection for diffusion-generated faces. Also proposed a new benchmark for forgery detection.

Diffusion Facial Forgery Detection
written by Harry ChengYangyang GuoTianyi WangLiqiang NieMohan Kankanhalli
(Submitted on 29 Jan 2024)
Comments: The dataset will be released at \url{this https URL}
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

code:  

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Conditional Diffusion Models (CDMs) have received a great deal of attention in the field of image generation in recent years. It has the ability to produce remarkably faithful images from simple inputs such as natural language prompts. However, this advancement has raised new concerns about security and privacy. For example, individuals with malicious intent can now easily and in large quantities generate fake images of arbitrary persons. This situation could have serious implications for society.

To address this problem, researchers are working to build datasets to identify and analyze diffusion-generated images. These datasets contribute to the development of forgery detection techniques by finding clues to detect subtle differences in image production. However, existing datasets are currently limited in their size and diversity, especially in the detection of facial forgery images.

To fill this gap, this paper proposes a dataset for diffusion facial forgery called Diffusion Facial Forgery (DiFF), which sets itself apart from all existing datasets in its size, diversity, and detailed annotation. DiFF is unique among existing datasets in its size, diversity, and detailed annotation. It is the first comprehensive dataset dedicated to diffusion-generated face forgery. As can be seen from the table below, it contains over 500,000 images of forged faces, significantly larger than any previous facial dataset. This will allow researchers to identify and analyze counterfeit images with unprecedented accuracy.

Furthermore, through experiments with DiFF, the paper reveals the limitations that existing forgery detection models face in detecting forgeries in diffusion-generated faces. To overcome this limitation, the paper also proposes a new method based on edge graphs, which can be integrated into existing models to significantly improve the accuracy of forgery detection.

As described above, this study is the result of three important efforts: building a high-quality diffusion-generated face dataset, providing a comprehensive benchmark, and developing a new detection method. In this article, we will focus specifically on the construction of a high-quality diffusion-generated face dataset.

What is Diffusion Facial Forgery (DiFF)?

As part of the data collection, images of 1,070 celebrities were carefully selected from celebrity datasets (e.g., VoxCeleb2 and CelebA). These celebrities are gender-balanced and cover a variety of age groups. For each celebrity, we selected approximately 20 images from online videos and existing datasets for a total of 23,661 images.

The next step is to generate face images. Previous studies have shown that a positive correlation exists: the higher the quality of the input prompts, the better the quality of the generated images. Based on this, we have designed a diverse and accurate set of prompts to help generate high-quality images with the Conditional Diffusion Model (CDM). diFF includes three types of prompts: the first is the original text prompt (P_t_ori), the second is a modified text prompt (P_t_mod), and the third is a visual prompt (P_v). All of these serve as guidelines for the diffusion model to generate the image.

The original text prompt (P_t_ori) semi-automatically generates diverse and natural text prompts. First, 2,531 high-quality images are curated by selecting clear images of each celebrity's frontal face. These images are converted into text descriptions using a prompt inversion tool, reviewed by experts, and rewritten to remove unnecessary terms and improve clarity. Through this process, 10,084 refined prompts have been created.

The modified text prompt (P_t_mod) randomly modifies the main attributes of P_t_ori (gender, hair color, facial expression, etc.) in order to expand the diversity of the prompt. This modification allows for the generation of images with certain features modified. For example, "a man with an emotional face" can be changed to "a woman with an emotional face".

The visual prompt (P_v) contains facial features (embeddings, sketches, landmarks, segmentations, etc.) extracted from each image. These features are applied to the diffusion model and are particularly useful for tasks such as face editing. Conditioning the diffusion model on visual cues allows for more specific image generation.

Finally, there is face forgery generation. Techniques in face forgery generation can be divided into four main approaches, depending on the type of input. Text-to-Image (T2I), Image-to-Image (I2I), Face Swapping (FS), and Face Editing (FE).

Text to Image (T2I) takes a specific text prompt (e.g., "men in uniform") and generates images that match the content. This method creates specific visuals from intuitive text-based instructions. Image to Image (I2I) and Face Swapping (FS), on the other hand, use visual input; I2I replicates the features of a specific identity, while FS performs more detailed manipulation by swapping the faces of two different identities. Face Editing (FE) employs a combination of both textual and visual conditions to modify certain facial attributes (e.g., facial expressions and lip movements) while preserving others. This approach allows for more complex editing.

In each of the categories in this paper, SoTA is employed to increase the diversity of the generated faces. Specifically, for text-to-image, we use methods such as Midjourney, Stable Diffusion XL (SDXL), FreeDoM T, and HPS. For image to image, Low-Rank Adaptation (LoRA), DreamBooth, SDXL Refiner, and FreeDoM I are used, which recapture and optimize specific facial features. For face swapping, DiffFace and DCFace are used to swap faces between different identities. Face editing uses Imagic, Cycle Diffusion (CycleDiff), and Collaborative Diffusion (CoDiff) for finer face editing.

The table below provides detailed statistics on DiFF, which employs 13 different methods to synthesize high-quality results based on 2,500 images and their corresponding 20,000 text prompts and 10,000 visual prompts.

The total number of images generated exceeds 500,000.

Summary

In this paper, we develop and publish DiFF, a large, high-quality, diffuse biomorphic face forgery dataset, to address the problems with existing datasets that underestimate the risks associated with face forgery. This dataset contains over 500,000 facial images, each created based on a variety of prompts and retaining a high degree of agreement with the original image.

The paper also includes extensive experiments with DiFF and proposes a new benchmark for face forgery detection. In addition, a new edge graph regularization method is developed to improve detection performance. In the future, we plan to extend DiFF to include a variety of methods and conditions, and to explore new DiFF-based challenges such as tracking and retrieval of diffusion-generated images.

In addition, the original facial images in the dataset constructed were obtained from publicly available online videos of celebrities. All prompts have been rigorously reviewed to ensure that they do not describe specific biometric information. The company claims that the generated images have been meticulously reviewed to ensure that they conform to social values. By thoroughly controlling the acquisition process of the DiFF dataset, the company claims to be working to minimize the risk of potential misuse. In addition, DiFF is available on Github here: https: //github.com/xaCheng1996/DiFF

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us