DigiFace-1M, A New Large-scale Dataset For Face Recognition Using Synthetic Facial Images
3 main points
✔️ Eliminates the problems of privacy invasion and lack of consent pointed out by traditional datasets for face recognition
✔️ Largest dataset for face recognition with 110,000 1.22 million composite face images
✔️ Achieves higher accuracy than the latest face recognition model SynFace with composite face images
DigiFace-1M: 1 Million Digital Face Images for Face Recognition
written by Gwangbin Bae, Martin de La Gorce, Tadas Baltrusaitis, Charlie Hewitt, Dong Chen, Julien Valentin, Roberto Cipolla, Jingjing Shen
(Submitted on 5 Oct 2022)
Comments: WACV 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
In the last couple of years, the accuracy of face recognition-related tasks has improved significantly due to the development of large data sets. For a face recognition model to recognize more people, it needs to learn more face images of more people (number of people). Also, to achieve higher versatility, many types of face images(number of images/person ) need to be trained for each person. For example, MS1MV2 consists of 5.8 million face images by 85,000 people (about 68 images/person ). Face260M consists of 260 million face images by 4 million people ( about 65 images /person).
However, on the other hand, many problems have been raised against large datasets: one is the issue of privacy. Many traditional large-scale datasets collect celebrity face images from the Internet to increase the number of people included in the dataset. They also use publicly available face images on Flickr and other sites. However, these images are not originally provided as training data for our face recognition models, and we have not obtained their consent. As a result, they have been criticized for invasion of privacy and other issues. Some datasets have restricted users. There is a possibility that this trend will spread in the future.
The other problem is label noise. Since we search for celebrities by name on the Internet and collect the face images that appear in the search results, the search results often contain incorrect face images. These become noise for training face recognition models, resulting in poor performance. Finally, there is the problem of data bias. This is also caused by the collection of celebrity face images on the Internet, which have special conditions such as makeup and lighting compared to ordinary people's face images. Also, many of them tend to be Caucasian. For example, in the large dataset CASIA-WebFace, 85% of the faces are Caucasian.
As mentioned above, traditional large datasets have many problems. Therefore, a dataset with synthesized face images has attracted much attention.
What is DigiFace-1M?
In this paper, we synthesize facial images using the method presented by Wood et al.(more details can be found in this article). The 3D scans of 511 consented faces were processed in a graphics pipeline to build a parametric model of the face geometry and texture, and then the shape, orientation, expression, texture, hair, accessories (clothing, make-up, glasses, head and face wear), etc. were randomly and randomly change the facial shape, facial orientation, facial expression, texture, hair, accessories (clothing, makeup, glasses, head and face wear), etc. to synthesize elaborate and diverse facial images. The following figure shows a sample of the synthesized face images included in the dataset.
In this paper, we generated 1.22 million face images consisting of 110,000 people (about 11 images/person). Still, it is possible to build even larger datasets depending on the cost of generating and storing the images.
Composition of the dataset
The dataset consists of 1.22 million face images (about 11 images/person) consisting of 110,000 people, but it is composed of two datasets.
One consists of 720,000 face images consisting of 10,000 people. For each person,4 sets of accessory combinations are prepared, and for each set, 18 images are synthesized with different camera conditions, face orientation, facial expressions, and shooting environment. In other words, 72 images (=4×18) are composited for each person. Since there are a variety of face images for a single person, we can expect to improve the system's versatility by training system.
The other consists of 500,000 face images consisting of 100,000 people. For each person, one set of accessory combinations is prepared, and for each set, five images are synthesized with different camera conditions, face orientation, facial expressions, and shooting environment. In other words, five images (=1×5) are composited for each person. Although the number of images for each person is small, the accuracy of identifying more people can be improved because of the large number of target people. The following figure shows a sample combination of accessories for each person. Clothes, glasses, makeup, face wear (mask, etc.), and headwear (hat, etc.) are applied randomly. Also, hair and beard are applied randomly in color and density. In the figure below, each row is the same person, and each person has 4 combinations of accessories.
And for each of these accessory combinations, images with different camera conditions, facial orientation, expressions, and shooting environments are synthesized as shown in the figure below. 18 images are synthesized for the first data set and 5 images for the second data set.
Since a composite face image is created to replace a real face image, preparing an image with the same conditions as the real face image as much as possible is required. Real-face images have partially hidden faces, distortions, and noise inherent to the camera used, but these do not exist in the synthesized face images. Therefore, we also perform data expansion to reduce these domain gaps. As shown in the figure below, Flip&Crop, Appearance (noise and blur), and Wrapping (displacement and distortion) are added to the Raw Image.
Impact of data expansion
The table below shows the results of the impact of data expansion. We can see that the accuracy (Accuracy) is improved in the dataset with data expansion (red box). In particular, we can see that the accuracy is improved on the datasets with many variations of face orientations (CFP-FP and CPLFW).
Composition of the dataset (number of people vs. number of images/person)
The table below compares the accuracy while varying the ratio of the two units in the dataset. We gradually change from a dataset with a high ratio of images/person (images/ID ) to a dataset with a high ratio of people (IDs ). The results in the figure below show that mixing the two units improves the accuracy compared to using only one of the units.
For a face recognition model to recognize more people, it needs to learn more face images of more people (number of people). Also, for higher generality, many types of face images (number of images/person) need to be trained for each person. Mixing two units with a different number of images/IDs is considered an efficient way to maximize both effects.
Comparison with SOTA (SynFace)
We compare its performance with SynFace, a state-of-the-art face recognition model trained on synthesized face images; SynFace uses DiscoFaceGAN to synthesize 500,000 face images consisting of 10,000 people( you can read more about it in this article).
The table below (rows 1-3) shows the results: comparing SynFace with the face recognition model (Ours) trained on the face images synthesized in this paper, we can see that it significantly outperforms SynFacefor all data sets. Note that Avg† is the average of LFW, CFP-FP, and CPLFW.
In addition, we evaluate the performance by adding 40,000 real images (Real images). This is a realistic number of images that can exclude label noise and data bias with the consent of the person. The table above (rows 4-6) shows the results. We can see that this also significantly outperforms SynFace for all data sets. Furthermore, as you can see from the above table (rows 1-3 ), the inclusion of real face images improves the accuracy. Here, we pre-trained with synthetic face images and then Finetune with real face images.
In particular, significant accuracy gains are seen in AgeDB and CALFW, and the paper suggests that these two datasets have particularly large domain gaps compared to other datasets. The face images synthesized in this paper do not reflect changes over time well, and the dataset needs to be updated in the future to take this into account. The following figure shows the results of the performance when a small number of real face images are added, and the relationship between the number of real face images added and the accuracy. The following figures show the results of the accuracy transition when training with only synthetic face images (black dashed line, Train on SX), when training with only a small number of real face images (red line, Train on Real), when training with a mixture of synthetic and real face images (blue line, Dataset Mixing), and when training with a mixture of synthetic and real face images (blue line, Dataset Mixing). The black line shows the transition when the training was done with synthetic face images and then fine-tuned with real face images (Train on SX & Finetune on Real).
We vary the number of people in the real face images from 200 to 2,000 and sample 20 images for each person. From this figure, it can be seen that the method of Finetune with real face images is the most accurate for all datasets, as the network is pre-trained with synthetic face images. When only a small number of real face images are available due to privacy concerns, the accuracy (Accuracy) can be significantly improved by using the dataset of synthetic face images presented in this paper.
In this paper, we use the graphics pipeline to build a large dataset of synthetic face images. It achieves much better accuracy (Accuracy) than SynFace, a face recognition model that is SOTA and is also trained on synthetic face images. SynFace also uses the DiscoFaceGAN, which uses many "real face images" when training the GAN. This means that it relies heavily on privacy violations, unconsented consent, label noise, and data biases that have been pointed out in traditional large datasets.
On the other hand, the large dataset proposed in this paper uses 3D scan data of 511 individuals with their consent. Therefore, it does not rely on the problems pointed out in conventional large-scale datasets. In addition, although not discussed in detail in this paper, it is possible to control attribute information such as facial expressions of the face images to be synthesized, making it possible to construct a dataset with higher quality than before. We expect that the development of datasets based on this kind of synthetic data will enable the development of high-performance face recognition models that are privacy-conscious, safe, and secure. The large-scale dataset published in this paper can be found here.
Categories related to this article