EMOCA Is Now Capable Of Generating More Expressive 3D Face Models From Input Images!

3D 24/08/2022

3 main points
✔️ Proposed EMOCA (Emotion Capture and Animation), a model for generating more expressive 3D face models from face images
✔️ Introduced a new loss function, Emotoin consistency loss, to accurately recover facial expressions from face images
✔️ Obtained performance comparable to state-of-the-art image-based methods on the emotion recognition task

EMOCA: Emotion Driven Monocular Face Capture and Animation
written by Radek Danecek, Michael J. Black, Timo Bolkart
(Submitted on 24 Apr 2022)
Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

As 3D facial avatars are widely used for communication in today's world of animation, games, and VR, it is becoming increasingly important to accurately convey emotions.

However, existing methods for regressing 3D face models from face images fail to capture detailed emotional information in the images, and the generated 3D face models lack emotional expression.

In contrast, the authors of this paper have developed a standard reconstruction metric used for training ( landmark reprojection error, photometric error, face recognition loss) used in training is inadequate to capture the emotion, resulting in a 3D face model shape that does not match the emotion in the input image.

The EMOCA (Emotion Capture and Animation) model presented in this paper addresses this problem by introducing a new loss function, emotion consistency loss, a new loss function that measures the difference between the restored 3D face model and the input image during training. In addition, the estimated 3D face parameters can be used to classify facial expressions, which is comparable to state-of-the-art image-based methods in the emotion recognition task. performance comparable to state-of-the-art image-based methods has been demonstrated. Let's take a look at each of these features.

EMOCA: EMOtion Capture and Animation

EMOCA is inspired by the emotion recognition task for facial images, which has made significant progress to date and is structured to train state-of-the-art emotion recognition models, which are then used as teachers when training EMOCA.

Specifically, it learns to convey emotional information to the 3D face model by optimizing the emotion consistency loss described above to match the emotional expression between the input image and the reconstructed 3D face model.

EMOCA is built on top of DECA, a 3D face reconstruction framework that achieves the highest identity shape reconstruction accuracy among existing methods. By adding a branch that can learn facial expressions to the DECA architecture and keeping the other parts fixed, it is possible to train only the facial expression part of EMOCA with emotion-rich image data while maintaining the quality of DECA's facial shape. The structure of EMOCA is shown in the figure below.

The learning of this model is divided into two learning stages: COARSE STAGE ( green box in the figure) and DETAIL STAGE ( yellow box in the figure).

In COARSE STAGE, the input image is passed to an initialized and fixed Coarse shape encoder from DECA and a trainable Expression Encoder from EMOCA.

Then, a textured 3D mesh is reconstructed from the regressed ID, expression parameters, posture parameters, and albedo parameters using the FLAME shape model and albedo model as Decoders. At this time, the emotion consistency loss penalizes the difference between the emotional features in the input image and the rendered emotional features.

Finally, in DETAIL STAGE, we fix the Expression Encoder of EMOCA and use the regressed expression parameters as conditions for the Detail Decoder.

This structure allows the 3D face models generated from a single image by EMOCA to significantly outperform existing state-of-the-art methods in terms of the quality of the reconstructed facial expressions, maintaining state-of-the-art accuracy in identity shape reconstruction, and in addition, the reconstructed 3D face models can be easily animated, the reconstructed 3D face models can be easily animated.

emotion consistency loss

In this model, we optimize the loss function expressed by the following equation.

In this equation, _Lemo = emotion consistency loss,_{Lpho =} photometric loss, _{Leye =} eye closure loss, _{Lmc =} mouth closure loss,_{Llc =} lip corner loss, _{Lψ =} expression regularizer, each weighted by a coefficient λx.

The emotion consistency loss is a novel loss function proposed in this paper, which computes the difference between the emotion features of the input image _εI and the rendered image _εRe as follows

Optimizing this loss during training allows the reconstructed 3D face model to convey the emotional information of the input image.

Experiments

In this paper, the first stage (COARSE STAGE) of EMOCA is trained using AffectNet with Adam optimizer and a learning rate of 5e-5 for up to 20 epochs, and the second stage (DETAIL STAGE) is unified to the same settings as DECA for quantitative and qualitative verification We conducted

Quantitative evaluation

In this validation, we evaluated the emotion recognition accuracy of EMOCA using the AffectNet and AFEW-VA test datasets by comparing them with existing methods.

For each method, we compared the scores of concordance correlation coefficients (CCC), Pearson correlation coefficients (PCC), root mean squared error (RMSE), and sign an agreement (SAGR) for valence (V), arousal (A), and expression classification (E) as defined in existing studies. concordance correlation coefficients (CCC), Pearson correlation coefficients (PCC), root mean squared error (RMSE) and sign agreement (SAGR ) scores for V, arousal (A), and expression classification (E) were compared.

The results are shown in the table below.

Thus, it is demonstrated that EMOCA outperforms all existing methods and is comparable to EmoNet, a state-of-the-art image-based method.

Qualitative evaluation

The figure below compares the reconstruction results of EMOCA with existing methods.

The generated 3D face models are, from left to right, the input image, 3DFFA-V2, MGCNet, Deng, et al, DECA, and EMOCA. This comparative validation confirms that EMOCA can learn the emotion of the input image more appropriately than other methods.

summary

How was it? In this article, I explained about EMOCA(EMOtion Capture and Animation) which is a method to reconstruct a 3D face model with emotional information of an input image using a single real image. EMOCA(EMOtion Capture and Animation) is a method to reconstruct a 3D face model

This paper is the first to focus on the perceptual quality of facial expressions and their emotions in 3D face reconstruction methods and represents a new direction in the community of these fields. This paper is also the first new attempt to integrate the fields of 3D face reconstruction methods and emotion analysis and is expected to be used in games, movies, and AR/VR.

However, as the acquisition and animation of such 3D face models improve, more realistic deep faking will become possible, and it may become difficult to detect such malicious faking. It can be said that it is a very difficult field to research while always being aware of such risks.

The details of the architecture of the model presented here and the generated 3D face model can be found in this paper if you are interested.