Fake It Till You Make It: Facial Analysis Using Only Composite Facial Images!
3 main points
✔️ Minimize the domain gap with real face data by synthesizing elaborate face data
✔️ Construct face-related models using only synthesized face data and achieve the same accuracy as SOTA
✔️ Release the dataset of synthesized face data constructed in this paper
Fake It Till You Make It: Face analysis in the wild using synthetic data alone
written by Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Matthew Johnson, Virginia Estellers, Thomas J. Cashman, Jamie Shotton
(Submitted on 30 Sep 2021 (v1), last revised 5 Oct 2021 (this version, v2))
Comments: ICCV 2021.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Several problems have been identified with the datasets used in face recognition. Most of the large datasets that have been published so far are face images collected from the Internet. For example, one method is to search for an actor's name and collect the face images that appear in the search results. This method often results in wrong search results, no labels other than names, and a bias towards certain races, genders, and ages. It requires a lot of effort to check and correct the labels and balance the data after data collection. In addition, most of the facial images collected are without the consent of the individual. Thus, label noise and privacy are seen as problems, and their use is beginning to be restricted in some areas. To further improve face recognition technology in the future, a larger and higher-quality dataset of face images is essential. However, it is not practical for humans to perform the tasks of obtaining consent from each face image provider, assigning accurate labels, and balancing the data.
This is why large datasets of synthetic face images have attracted much attention. Several efforts have been reported so far. However, it is well-known that face recognition models based on synthetic face images are less accurate than those based on real face images due to the domain gap between synthetic face images and real face images.
Until now, it has been widely believed that it is difficult for synthetic face images to completely replace real face images. Therefore, domain-gap methods using real face images have been proposed, such as mixing up some real face images as well as synthetic face images.
However, in the paper presented here, we attempt to minimize the domain gap and build a highly accurate face recognition-related model using only synthetic face images without using real face images. If even a few real face images are required, noise labels and privacy issues arise. If it becomes possible to construct face recognition-related models using only completely synthetic face images, as this paper attempts to do, it will make it easier to conduct future face recognition-related research, which will greatly contribute to the field of face recognition. This is a very promising field.
The generation process of synthetic face data
The face data generation process consists of six steps as shown in the figure below. First, the base face data (Template face) is prepared, and then Identity, Expression, Texture, Hair, Clothes, and Environment are randomly applied to it.
Template face uses previous studies ( Gerig et al., 2017, Li et al., 2017 ). Next, Identity data uses 3D scan data of 511 people obtained from Infinite Realities and Ten24. The acquired data (Raw) is manually cleaned of noise and hair (Clean) as shown in the figure below. In addition, not only Identity but also Texture is acquired from this data.
In addition, 3D scan data is collected from as wide a range of ages, genders, and races as possible, as shown in the histogram below, to capture the diversity of the real world.
The following is a sample of face data with Identity applied to the Template face.
Next, we have Expression. We use 2D face images annotated with facial landmarks. Since this annotation data alone is not enough to provide a variety of facial expressions, we create an animation of the facial expression changes and add a variety of facial expressions extracted fromtheanimationsequence.
Next is Texture. In this paper, we aim to synthesize a face image that can replace a real face image, so it is designed to look exactly like a real face even when zoomed in. As mentioned above, we collect200 sets of high-resolution (8192 x 8192 pixels ) Textures from 3D scan data, from which we remove noise and hair(Clean ).
For each3D scan, we extract Albedo for skin color and Displacement (Coarse-and Meso-) for two granularities. As shown in the figure below, Coarse-Displacement (+coarse disp.) shows larger irregularities andMeso-Displacement(+meso disp.) shows more detailed irregularities such as pores. In addition to this, makeup effects such as eye shadow, eyeliner, and mascara are applied as needed.
Next is Hair, which provides 512 hairstyles, 162 eyebrows, 142 goatees, and 42 eyelashes, which are applied in random combinations. The figure below is a sample. Note that each asset was created by an artist who specializes in creating digital hair.
Finally, Clothes are applied to create 30 different types of clothing. As shown in the figure below, it includes different types of formal, casual, and athletic clothing. In addition to Clothes, we also provide headwear (36 types), Facewear (7 types), and Eyewear (11 types), including helmets, headscarves, face masks, and glasses. The Clothes are made to fit snugly with the appropriate deformation technology, while the glasses are made to be placed on the temples and bridge of the nose.
The following figure shows a sample face image synthesized by randomlyapplyingeach of the components (Identity, Expression, Texture, Hair, Clothes, and Environment)introduced so far.
Very elaborate face images are synthesized. In the next section, we evaluate whether these lifelike face images can replace real face images in two popular face analysis tasks: Face Parsingand Face Landmark.
Face Parsing with Synthetic Data
Face Parsing is a task for labeling face images pixel by pixel. Each pixel is labeled with an eye, nose, mouth, etc., and then evaluated for its match rate with the correct label. We evaluate the model trained on a dataset of synthetic face images using two well-known benchmarks, Helen and LaPa.
The table below shows the results (F1 score) of our evaluation against the traditional methods using the Helen dataset. Overall is calculated for the nose, eyebrows, eyes, and mouth together. As can be seen from the table, the results show that Ours (synthetic) is comparable to the conventional SOTA.
The table below shows the results of the evaluation (F1 score) of the LaPa dataset compared to the conventional method, where L-eye and R-eye represent the left and right eyes, and L-Brow and R-Brow represent the left and right eyebrows. Also, U-lip, I-mouth, and L-lip represent the upper lip, between the upper and lower lip and lower lip. Here, too, as can be seen from the table, the results show that Ours (synthetic) is comparable to the conventional SOTA.
SOTA and Ours (real) are trained on real face images and tested on real face images, while Ours (synthetic) is trained only on synthetic face images and tested on real face images.
Face Landmark by synthetic data
Face Landmark is a task to take the positions (feature points) of the eyes, nose, mouth, and contour of a face. The table below shows the results of the evaluation of Face Landmark compared to the conventional method using300W dataset(NME=Normalized Mean Error, FR=Failure Rate). The lower the value, the higher the performance. 68 points are taken and evaluated.
Here, too, we can see from the table that Ours (synthetic) achieves the same level of results as the conventional SOTA. The lower figure shows the prediction results of the network trained on a real face image (top) and a synthetic face image (bottom).
In addition to this, the paper also presents an example of how a face data generation model can be used to easily generate high-density landmark and eye-tracking data that is difficult to obtain with real face data. The ability to generate data that looks exactly like real face images will make it easier to add and generate data that is difficult to collect and label manually.
In this paper, we show that detailed face images can be generated to reduce and replace the domain gap with real face images. Although there have been previous attempts to substitute synthetic face images, they have all incorporated some amount of real images. However, in this study, we have shown that the performance of face Parsing and Face Landmark is comparable to that of SOTA using only synthetic face images. This is an important result that shows the possibility of using synthetic face images instead of real face images for many other face-related tasks. Easier access to face images could further improve the accuracy of face-related tasks. In this paper, the validation was limited to face images from the neck up, wrinkles associated with changes in facial expression were not represented, and the validation conditions were more limited than those for real face images. The dataset is now available. If you are interested, you can check it out here.
Categories related to this article