What Is FaceController, An Efficient Face Editing Model?
3 main points
✔️ Proposes a simple feed-forward face generation network instead of time-consuming and labor-intensive Reverse Mapping
✔️ By extracting easily obtainable independent attribute information, we can generate high-fidelity face images with only specific attribute information changed.
✔️ We take Face Swapping as an example and achieve the same or better performance than the conventional
FaceController: Controllable Attribute Editing for Face in the Wild
written by Zhiliang Xu, Xiyu Yu, Zhibin Hong, Zhen Zhu, Junyu Han, Jingtuo Liu, Errui Ding, Xiang Bai
(Submitted on 23 Feb 2021)
Comments: Accepted at AAAI 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Editing face images is much used in the field of visual effects and e-commerce.
For the face editing task, it is important to have accurate and independently separated face attribute information so that specific attributes can be manipulated accurately. If the accuracy of this attribute information is high, it is possible to change the orientation and expression of the face while maintaining its identity by editing specific attribute information. However, it is still difficult to extract such accurate and independently separated face attribute information.
Various studies have been conducted and one of them, for example, GAN Inversion, has been proposed. However, this approach is time-consuming and labor-intensive. Therefore, in this paper, we propose a simple feed-forward face generation network, FaceController, which is supplied with existing easily obtainable prior information.
This method avoids the costly learning process of extracting the attribute information of independently separated faces. In addition, the method achieves higher performance in many metrics compared to traditional models and shows superior results in qualitative and visual comparisons.
Architecture of FaceController
The process in FaceController consists of three steps. The first is the extraction of face attribute information, the second is the exchange of face attribute information between the source image and the target image, and the third is the generation of the target image with exchanged face attribute information. The specific architecture is shown in the figure below. This figure is an example of Face Swapping.
First of all, 3DMM is used to clearly separate and extract the face attribute information from each of the source image (Is) and target image (It). 3DMM is a commonly used method to separate and extract face attribute information. It decomposes and describes the face attribute information into shape (S) and texture (T) respectively.
This S and T are represented by the specified vectors of principal component analysis, Ibase, Ebase, and Tbase, and their standard deviations α, ρ, and δ, as shown in Eq. Ibase, Ebase, Tbase represents ID, expression, and texture, respectively. In addition, the lighting situation k and pose θ are also defined.
For the source image (Is) and the target image (It), the extracted attribute information is described in the left part of the figure as (αs, ρs, κs, δs, θs) and are denoted as (αt, ρt, κt, δt, θt).
However, we found that the IDs and textures extracted by the 3DMM are not sufficient to generate realistic face images with visually comfortable accuracy, so here we introduce two Encoders to complement the feature information of the IDs and textures. We believe this is due to the domain gap between the corresponding face rendered from the 3DMM and the elephant. For example, a woman's makeup cannot be fully represented in the 3DMM. For this reason, as shown in the middle of the figure, we have introduced an Identity Encoder to compliment the ID and a Style Encoder to complement the texture.
The Identity Encoder uses a state-of-the-art pre-trained face recognition model ( Deng et al. 2019a ). The Feature Map just before the last FC layer is used to obtain accurate and high-level identity information. A Spatial Transformation Network ( Jaderberg et al. 2015 ) is applied to input face images to the ID Encoder for accurate positioning.
To support editing of regions of local detail, Style Encoder uses semantic segmentation of face images to obtain a style code for each region. By editing each local region, for example, in the case of makeup, lip color and eye shadow makeup can be edited. You will also be able to adjust the lighting of a specific face image. We use SEAN's Encoder ( Zhu et al. 2020b ) for the Style Encoder in the local region.
As described above, 3DMM, ID Encoder, and Style Encoder are used to separate and extract accurate and detailed attribute information.
The next step is to generate face images with high fidelity by using the feature information obtained so far. Our goal here is to transform/generate natural face images in a specific style for each local region with semantic labels. Therefore, we build a method by applying SPADE, which is good at supporting editing for each local region. Moreover, we need to consider additional attribute information that is not related to the local region information, such as ID obtained earlier.
To support this, we have designed Identity-Style Normalization that integrates ID and style information of each local region into Decoder as shown in the right figure below. By incorporating this information into the IS Block, we can generate images with high fidelity.
The next step is to train the model. During training, we mainly consider two different learning processes: face reconstruction and unsupervised face generation.
During the learning of face reconstruction, the model tries to reconstruct the face by retrieving attribute information from the same face image. In this case, the source image (Is) and the target image (It) are equal in the diagram of the architecture. This learning ensures the validity of the image generated when Is = It. However, if Is ≠ It there is no guarantee that the model will work properly.
Therefore, we include the training of unsupervised face generation, and the unpaired input (Is ≠ It ), the model needs to be able to generate a plausible face image from the acquired feature information. In order to increase the image fidelity, GAN Loss is applied to both the 'face reconstruction' and 'unsupervised face generation' training processes. Overall, Loss is defined as follows.
Ladv represents the GAN Loss. And Lper represents Perceptual Loss which is applied as Face Reconstruction Loss. The Feature Map between the generated face image (Ig) and the target image (It) is extracted from the pre-trained VGG and reconstructed at the pixel level as follows.
The remaining Lid, Lim, and Lhm represent Identity Loss, Face landmark Loss, and Histogram Mating Loss, respectively, which are designed to support unsupervised learning and enhance the separate extraction of attribute information. The purpose of this model is to support free and dynamic face attribute control over face images. If the goal is to transfer other attributes such as facial expression and orientation from It to Is while preserving the ID, the ID Encoder can be used to encourage the generated face image to keep the same ID as Is. Cosine similarity is used to estimate the similarity between the generated image and the source image.
You can also add attributes such as ID and texture to the Is to It and if you want to preserve the facial expression and orientation, you can use Landmark Loss to transfer the Is and the generated face image (Ig) to ensure consistency of expression and face orientation.
In this paper, Landmark Loss is also designed specifically for ID retention. Since we want the generated image (Ig) to retain the same expression and orientation as the target image (It), we make sure that Ig and It have the same landmarks. The facial landmarks include the shape of the eyes, mouth, eyebrows, and nose. These are also related to the ID information.
As shown in figures (a) and (b) below, different people have different facial features and landmarks. Therefore, as shown in figure (c) below, constraining the generated image using It landmarks is not a good way to retain Is ID information; we need to adjust the landmarks to have the same expression and orientation as It, while retaining the same ID information as Is.
To solve this problem, 3D landmarks are extracted using the edited 3DMM as shown in the first figure. These landmarks are shown in the above figure (d). Is and keep exactly the same facial features.
We then match Ig's 3D landmarks with the aligned landmarks to better preserve the ID information; ID Loss (Lid) can preserve the correct ID, but it still has difficulty preserving texture and color consistency in local regions. To solve this problem, Let Ire = HM（It、Ig)be the remapped image obtained by histogram matching between Ig and It, which can be expressed as follows.
FaceController can be applied to a variety of tasks, but here we present results for Face Swapping. We compare FaceController trained on CelebA-HQ, FFHQ, and VGGFace with typical Face Swapping models: DeepFake, FaceSwap, FSGAN, and FaceShifter.
We use FaceForensics++ to evaluate the results. The figure below shows the qualitative results. It can be seen that we are able to generate images that are equal or better than conventional models.
In models that include a blending process for face images, such as DeepFake and FSGAN, we can see that there are traces of the replaced parts that are visible to the human eye. On the other hand, models that do not include the blending process, such as FaceController and FaceShifter, produce realistic face images with less visible traces of the replaced parts.
In addition, when compared to FaceShifter, FaceController supplements the ID and detail texture information, so the ID information is more clearly reflected and the composite result is more comfortable. The table below shows the quantitative results. Again, the results are evaluated with FaceForensics++ and compared to DeepFake, FaceSwap, FSGAN, and FaceShifter.
Here, we use the evaluation metrics used in FaceShifter. We acquire 10 frames from each of 1,000 videos, for a total of 10,000 face images, and evaluate three performance measures.
The first is ID Retrieval (Retr.). This evaluates whether the ID information of the source face is retained after editing the face image. Next is Pose. This evaluates whether the orientation of the face is preserved after editing the face image. The last one is Expression (Exp.). This evaluates whether the expression of the edited target image is retained. It also uses the FID value. This evaluates the fidelity of the edited face.
In the evaluation of Retr., CosFace is applied to extract ID information, and the face with the closest cosine similarity is selected. In the evaluation of Pose and Exp., the orientation of the face is estimated and the expression information is extracted by the expression recognition model, respectively, and the similarity is evaluated by L2 distance.
The results show that FaceController has the best performance in terms of ID retention and fidelity of generated images compared to existing models. This can be attributed to the good performance of the ID Encoder and Style Encoder. In this paper, to verify the effect of ID Encoder and Style Encoder, we also evaluate the performance when only ID Encoder is introduced and when only Style Encoder is introduced.
The qualitative results are shown in the figure below. When only the ID Encoder is introduced (w/o style), the generated style does not match the target image (Target) in areas such as the lips, as shown in the fourth row of images. This indicates that 3DMM alone does not provide detailed textures for generating faces with high fidelity. In addition, when only Style Encoder is introduced, we can see that the generated image has a difference in ID information from the source image (Source), as shown in the third row.
The quantitative results are shown in the table below and are the same as the qualitative results.
When only Style Encoder is introduced, the ID information acquisition system is greatly reduced. When only ID Encoder is introduced, ID is hardly affected by Disentanglement. In addition, we can see that the orientation and expression of the face are not affected.
In this paper, we propose a face editing model (FaceController) for feed-forward networks that is more efficient than previous models and successfully generates very high fidelity face images with specific attribute information edited. The model also separates and extracts facial feature information by Disentanglement and proposes Unsupervised Loss to ensure control over various facial attributes. FaceController can be widely applied in various face-related applications. However, on the other hand, further improvements may be needed, since a significant change in the face orientation may cause black spots on the edges of the generated image or misalignment of the gaze.
Categories related to this article