IDiff-Face: Evolution Of Face Recognition Technology Using Synthetic Data And Addressing Legal And Ethical Issues
3 main points
✔️ Building the "IDiff-Face" dataset: Proposing a new synthetic dataset "IDiff-Face" to address legal and ethical issues
✔️ Application to face recognition technology: Face recognition using IDiff-Face achieves higher accuracy than conventional synthetic datasets
✔️ Balancing privacy protection and technological evolution: Addresses the challenge of privacy protection in face recognition dataset generation, while providing a new method to facilitate the evolution of face recognition technology
IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Models
written by Fadi Boutros, Jonas Henry Grebe, Arjan Kuijper, Naser Damer
(Submitted on 9 Aug 2023 (v1), last revised 10 Aug 2023 (this version, v2))
Comments: Accepted at ICCV2023
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
First of all
Since the breakthrough of deep learning, facial recognition technology has improved dramatically in accuracy and is used in many aspects of daily life. And a major contribution to this face recognition technology has been large datasets. However, most of these datasets were collected from the Internet without the consent of the user. This has led to legal and ethical issues, and many datasets are no longer available.
To address this legal and ethical issue, synthetic face image substitutions have been attracting attention. However, conventional synthetic datasets have problems such as lack of intra-class diversity of individual faces and difficulty in identity discrimination between different faces. In this paper, we propose a new method for generating synthetic datasets called "IDiff-Face" to address these issues, and show that face recognition using IDiff-Face is more accurate than existing synthetic datasets, and achieves accuracy close to that of face recognition using datasets with real face images. It achieves accuracy close to that of a dataset based on actual facial images.
This paper proposes a new method for further advancing face recognition techniques using synthesized datasets, while avoiding legal and ethical issues.
What is "IDiff-Face"?
The figure below is an overview of IDiff-Face. It is divided into two parts: the upper part (Traning) and the lower part (Sampling). The upper part visualizes the learning process. The lower part shows the conditional sampling process. In the upper part, the learning process, the Denoising U-Net is conditioned on a context based on features obtained from a pre-trained face recognition model; the entire learning process for the Deffusion Model (DM) takes place in the latent space of the pre-trained Autoencoder (AE). In the lower sampling, the learned Deffusion Model (DM) generates samples based on three types of facial features. By fixing the facial features and varying the added noise, different samples of the same identity can be generated.
IDiff-Face is based on a deep learning model called the Denoising Diffusion Probability Model (DDPM), which is trained on a pre-trained autoencoder latent space. At its core is conditioning based on the features, or "identity context," obtained using the face recognition model. This conditioning allows IDiff-Face to generate identity-specific face images. Another important feature of IDiff-Face is the generation of synthetic images. This process can produce a realistic facial image of a non-existent person based on the features of the input facial image. The technology can generate variants of existing images as well as generate new synthetic identity images.
In addition, the Contextual Partial Dropout (CPD) technique is used to increase the diversity of the generated images. It is devised to prevent over-fitting the identity context and to allow different images to be generated even using the same identity context. This process preserves diversity in image generation by randomly ignoring parts of the context.
The figure below visually compares a typical dataset of synthetic face images with the dataset from IDiff-Face proposed in this paper. The top group (blue) are the synthetic face images used in the synthetic face recognition model known as SOTA. The next group (green) presents samples from the IDiff-Face model with different CPD probabilities and different types of synthetic embeddings. The last group (yellow) is a sample of the different variations of identities generated by the proposed method on the existing LFW dataset. There are four images per identity, with two identities exemplified by each method.
Synthetic face recognition models such as SynFace and USynthFace use synthetic images generated by DiscoFaceGAN, which is a separate based on learning representations. Because the generated face images are controlled by a set of predefined attributes, they may lack the intra-class diversity present in real-world face images. SFace, on the other hand, is a class-conditional GAN model that does not explicitly model these attributes. It is conditionally trained to produce synthetic images with specific labels. It can produce images with more intra-class variation, but is characterized by low identity separability. In contrast, DigiFace-1M images are generated by 3D MMM rendering; the DigiFace-1M identities are artificially defined as a combination of facial geometry, texture, and especially hair style. However, this approach is extremely computationally expensive and not suitable for research purposes, as it uses a sophisticated computational rendering pipeline to generate large data sets.
Here we quantitatively evaluate the differences between the face images generated by the various methods described earlier. The table below shows the results of the evaluation of identity separability on the synthetic dataset with the proposed model. The first two rows show the results for the authentic LFW and CASIA-WebFace datasets. Compared to LFW and CASIA-WebFace, which arecomposedof real face images, IDiff-Face (0% CPD ) shows similar performance in Two-Stage andUniform.EERat LFW is 0.002 and at IDiff-Face (CPD 0%) is 0.003 (Two-Stage,)0.007 (Uniform ) .
Performance is also evaluated on synthetic data generatedusing Uniform with Contextual Probability Distributions (CPD) probabilities of 0%, 25%, and 50%, respectively, as well as on synthetic data generated usingTwo-Stage. All training datasets consist of 5,000 identities with 16 samples per identity for a total of 80,000 samples, evaluated on five benchmarks: LFW, AgeDB-30, CA-LFW, CFP-FP, and CP-LFW.
As can be seen from the table above, the face recognition models trained on the IDiff-Face dataset achieve high accuracy even when using a small synthetic dataset (80K samples). For models trained on the Two-Stage generated dataset, face recognition models trained on the CPD25 and CPD50 generated datasets achieve very competitive results. CPD is shown to significantly improve the accuracy of face recognition by increasing the intra-class variability in the generated samples.
The table below also shows SOTA's validation accuracy on five benchmark tests in synthetic-based face recognition. The first two rows show the results of face recognition models trained on real face image data. They are used as a comparison. The face recognition model for synthetic face image data uses ResNet-50. The best verification accuracy for the synthetic-based face recognition model is bolded, and the second best is underlined.
Face recognition models trained with IDiff-Face outperform all previous synthesis-based face recognition models, with an average accuracy of 88.20% achieved by models using IDiff-Face and an average accuracy of 83.45% (based on DigiFace- 1M). In addition, IDiff-Face improves the accuracy of face recognition in all experimental settings by increasing the size of the training dataset. Furthermore, higher accuracy is achieved by increasing the dataset width (number of identities) compared to increasing the dataset depth (number of images per identity). For example, IDiff-Face using CPD25 (Uniform) achieves an average accuracy of 82.86% when using 160K samples (5K identities, 32 images for each identity). This accuracy improves to 83.87% when trained on 160K (10K identities, 16 images for each identity).
Large intra-class variability is required in training datasets for face recognition. Datasets of real face images possess these characteristics and have contributed significantly to improving the accuracy of face recognition. However, privacy concerns have made it difficult to use real face images for training face recognition. In this study, we propose "IDiff-Face" to solve this problem. This is an identity conditional generative model based on the Diffusion Model (DM). It also introduces Contextual Probability Distributions (CPD), which prevents the model from over-fitting the identity context and controls the trade-off between identity separability and intra-class variation. We propose this as a simple yet effective mechanism. Furthermore, by using IDiff-Face, we achieve SOTA accuracy on five major face recognition benchmarks, outperforming the performance of the leading synthetic-based face recognition methods.
Categories related to this article