Now There's A Technique For Editing The Facial Movements Of Characters In A Video To Match Any Emotion!

CVPR 05/08/2022

3 main points
✔️ Proposed Cross-Reconstructed Emotion Disentanglement to separate audio into emotion-related features and speech content-related features
✔️ Proposed Target-Adaptive Face Synthesis to bridge the gap between estimated landmarks and input video motion
✔️ Enables video editing methods with emotion control capabilities not available in existing methods

Audio-Driven Emotional Video Portraits
written by Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, Feng Xu
(Submitted on 15 Apr 2021 (v1), last revised 20 May 2021 (this version, v2))
Comments: Accepted by CVPR2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

A research area called audio-driven talking heads, which edits a person's face in a video to synchronize with the input audio, has been proposed in various methods because of its great need for film production and telepresence.

However, most of these previous studies have focused on the correlation between the content of speech and a person's mouth, and no method has been developed that takes into account emotion, an important feature in human facial expressions.

The following challenges were cited as the reasons for this when modeling emotional expressions by speech intonation.

The difficulty of extracting emotion from the speech is because emotional information is intricately intertwined with other features such as speech content.
Difficulty in synthesizing faces and images that contain edited emotional information

This paper introduces Emotional Video Portraits (EVP) to solve Problem 1. Cross-Reconstructed Emotion Disentanglement to solve Challenge 2, Target-Adaptive Face Synthesis to solve problem 1 and Target-Adaptive Face Synthesis to solve problem 2, and it is the first model to realize emotion control in this field.

Overview of Emotional Video Portraits (EVP)

As shown in the figure below, EVP consists of two main components: Cross-Reconstructed Emotion Disentanglement and Target-Adaptive Face Synthesis.

Pseudo Training Pairs

To realize emotion control by speech synthesis, it is necessary to extract information about emotion and speech content, which are inherently intertwined in a complex manner, independently from speech signals.

So, to separate this information. cross reconstruction from existing research, but since this method requires pairing audio clips of the same utterance but with different emotions of the same length, we use an audio-visual dataset of various characters speaking the same content in different emotional states, and construct two pseudo-training pairs.

Specifically, Mel Frequency Cepstral Coefficients (MFCC) is used to acquire speech information, and Dynamic Timig Warping (DTW) is used to align the length of two speech clips by stretching and shrinking the MFCC feature vector along the time dimension.

The training pairs thus created are used to train the Cross-Reconstructed Emotion Disentanglement below.

Cross-Reconstructed Emotion Disentanglement

The learning procedure for Cross-Reconstructed Emotion Disentanglement is shown in the figure below.

Speech clip _Xi,m consists of information on speech content i and information on emotion m, and speech clip _Xj,n consists of speech content information j and emotion information n Emotion Encoder (Ee ₎ and Content Encoder (Ec ₎ are used to extract information independently from

This allows the audio clip when the two pieces of information are completely separated X_i,m and X_j,n obtained from the content embedding _Ec ( X_i,m ) and emotion embedding _Ee ( X_j,n ), we can reconstruct the audio clip _Xi,n can be reconstructed.

Target-Adaptive Face Synthesis

In this paper, we propose a method called Target-Adaptive Face Synthesis to bridge the gap between the facial landmarks generated by the separated audio information and the variations in the posture and motion of the characters in the video.

This method consists of the following three processes

Audio-To-Landmark Module for Predicting Landmark Movements from Separated Audio Information
3D-Aware Keypoint Alignment for aligning the generated facial landmarks with the facial landmarks of the characters in the video in 3D space.
Edge-to-Video Translation Network to synthesize the generated landmarks and edge maps of the target frame

Let's look at them one at a time.

1. Audio-To-Landmark Module

The goal of this process is to predict the position and movement of landmarks from audio clips of the extracted emotional information, which requires that the facial shape from the aligned landmarks, i.e. the identity information of the person in the video, is not changed.

Therefore, we used the multilayer perceptron to extract the landmark identity embedding _fa and extract the f_a is The content embedding E_c and the content embedding E_e are sent to the audio-to-landmark module together with

The audio-to-landmark module then uses the LSTM network to predict the landmark _ld by the LSTM network.

2. 3D-Aware Keypoint Alignment

To align the head pose, we first perform landmark detection on the video using existing methods and then a parametric 3D face model to solve a nonlinear optimization problem to recover 3D parameters from 2D landmarks.

Then, from the shape and expression parameters, we obtain _L3dp, the set of pose-invariant 3D landmarks, as shown in the equation below.

where m is the location of the average facial landmark, and _bgeok and _bexpk are the basis of the shape (geometry) and expression (expression) computed by principal component analysis of a high-quality face scan and blendshape (an animation technique).

3. Edge-to-Video Translation Network

Given the landmarks obtained from training and the target frame, the edge maps extracted from the landmarks and the frame are combined to create a guidance map.

Specifically, the edge detection algorithm Canny Edge Detection is used to detect edges in regions other than faces, the original landmark positions are replaced with landmarks obtained by training, and then adjacent facial landmarks are concatenated to create a face sketch.

This makes it possible to generate smooth, realistic frames that match the movements of the people in the video.

Qualitative Comparisons

In this paper, we compare our method with the following three existing methods.

ATVGnet (Chen et al. 2019): an image-based method that synthesizes facial motion based on landmarks and employs an attention mechanism to improve generation quality
Everybody's Talkin' (Song et al. 2020): a video-based method for video editing with voice by applying 3D face models.
MEAD (Wang et al. 2020): the first face generation approach method with emotion control features that is most relevant to our method

The generated results are shown in the figure below.

From these results, we can see that

Chan and Song's method does not take emotion into account, so it generates plausible mouth shapes, but always with neutral emotion
Wang's method learns mouth shapes directly from speech signals in which information about emotion and speech content are intertwined, so the predicted mouth shape emotion may not match the facial expression (red box to the left of Wang's row).
- In addition, Wang's method is not sufficiently robust to data with large head movements or changing backgrounds, and features such as impossible facial expressions (the expression in the middle of Wang's row) or hairstyles may change (the red box on the right of Wang's row).
Compared with these methods, our method can generate emotional facial images with high fidelity.

Thus, it is demonstrated that our method has a very good performance compared with the existing methods.

Quantitative Comparisons

To quantitatively evaluate this method and existing methods, LD ( Landmark Distance=. The average Euclidean distance between the generated landmarks and the actual landmarks) and LVD( Landmark Velocity Difference = (average Euclidean distance between generated landmarks and actual landmarks) and LVD ( Velocity Difference between landmarks) were used to evaluate the face motion.

We applied LD and LVD in the mouth and face regions. We separately evaluated how accurately the synthesized videos represented lip movements and facial expressions, in addition to comparing the scores of SSIM, PSNR, and FID, which are evaluation metrics of existing methods.

The results are shown in the table below.

M represents the Mouth region and F represents the Face region.

User Study

To quantify the quality of the generated video clips, we conducted a user study using our method, three existing methods, and real videos under the following conditions

Three video clips were generated for each of the eight emotion categories and each of the three speakers, for a total of 72 videos rated
The evaluation was done in two stages: first, participants were asked to rate the audio and video quality of a given video and to give it a score from 1 (worst) to 5 (best).
- Then, after watching the real video clip without background sound, we select an emotion category for the generated video without sound and evaluate whether we have generated appropriate emotional expressions.

The results of the 50-participant survey are shown below.

Thus, it was confirmed that our method obtained the highest scores for both the quality of the generated video and the synchronization with the audio, and it was also evaluated to have the highest accuracy for sentiment classification compared to existing methods.

summary

How was it? In this issue. Cross-Reconstructed Emotion Disentanglement and Target-Adaptive Face Synthesis Emotional Video Portraits (EVP), a model that enables video editing with emotion control Emotional Video Portraits (EVP), which is a model

As you can see in the generated video, this method produces very natural facial expressions for the "Happy" and "Angry" conditions, and we expect further progress in this research field based on this paper.

The details of the architecture of the model presented here and the generated video can be found in this paper if you are interested.