Catch up on the latest AI articles

More Realistic Facial 3D Animations Can Be Generated From Audio!

More Realistic Facial 3D Animations Can Be Generated From Audio!


3 main points
✔️ Proposed a new audio-driven facial animation method that separates input into speech and expression signals
✔️ Create a new loss function, cross-modality loss, which considers two inputs
✔️ Successfully generated more plausible upper face animation compared to existing methods

MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement
written by Alexander RichardMichale ZollhoferYandong WenFemando de la TorreYaser Sheikh
(Submitted on 16 Apr 2021)
Comments: ICCV 2021

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Audio-driven facial animation is a task that takes speech It is a task that takes speech sounds and a 3D mesh of a template face as input and outputs a 3D animation of the whole face movement while the user is saying the content of the speech. It is a research field that is attracting attention because it can be used in various fields such as computer games and virtual avatars.

Although various approaches have been developed in this field to date, one of the major unresolved issues in existing research is the difficulty in generating facial movements that are uncorrelated with speech information, such as blink and eyebrow movements.

such that there is no correlation between what is being said and when the speaker blinks. Because these upper face movements are difficult to learn, previous studies have been unable to generate plausible upper face movements.

The MeshTalk model presented in this paper is a model that can output plausible upper face motion by separating the input into two parts, speech signal, and expression signal, and succeeds in generating a more realistic 3D animation of the whole face compared to existing research. This model is more realistic than existing research.

MeshTalk Overview

The network architecture of MeshTalk is shown in the figure below.

This model uses three inputs: the speech signal, which is a sequence of T speech sounds, the template mesh, which is represented by the V verticesexpression signal, and a sequence of T face meshes which is the basis for animation generation.

Audio encoder

The audio encoder converts the speech signal into a mel-spectrogram, a set of speech features, and generates a new feature vector every 10 ms.

The network is a one-dimensional convolution, and skip connections are used between each layer. (See the figure below)

Expression encoder

The expression encoder flattens a T × V × 3 (dimensional) input mesh sequence to T × V × 3 and maps it to 128 dimensions using all coupling layers.

We also use a single LSTM layer to learn the temporal information along the input mesh of the animation, and the final output is a linear projection (linear projection) of the output of the LSTM layer in 128 dimensions.

Fusion model

The output from the audio encoder and the expression encoder are concatenated in all three coupling layers using a fusion model.

The final output size is T × H × C, where T is the sequence length, H is the latent classification heads, and C is the number of categories.


Decoder maps the template mesh to 128 dimensions via three full-coupling layers fc as shown in the figure below. The template mesh is mapped to 128 dimensions, then it is concatenated with the latent embedding obtained by the fusion model and re-mapped to 128 dimensions by all coupling layers as shown in the figure below.

Two LSTM layers are used to model the time dependence arising from latent embedding, and each fully coupled layer has a skip connection.

The final output is re-projected into a V × 3 (dimensional) space to generate the face animation after learning.

cross-modality loss

The main difference between this model and existing research is that it uses two features as input: the speech signal, which is a sequence of speech utterances, and the expression signal, which is a sequence of facial expressions.

However, if we use the existing L2 reconstruction loss function, the speech signal is ignored because the expression signal already contains all the information needed to reconstruct the animation. This can lead to problems with synchronization between speech and lip movements.

To cope with this problem, we propose a new loss function called cross-modality loss.

The speech signal, the sequence of input speech sounds, is denoted by a1:T, andx1:T for the expression signal, which is the sequence of the corresponding face mesh the output 3D animation is represented as follows.

where in Eq. x~ 1:T and a~ 1:T are randomly sampled speech and facial expression data from the dataset, so the upper equation represents the output given the correct speech input and a random facial expression, while the lower equation represents the output given the correct facial expression and random speech input.

The cross-modality loss is then defined as

where M (upper ) is a mask that assigns higher weights to the upper part of the face and lowers weights to the area around the mouth, and M (mouth) is a mask that assigns higher weights to the mouth and lower weights to the upper part of the face.

This cross-modality loss enables accurate reconstruction of the upper part of the face independent of speech input and accurate reconstruction of the mouth area associated with speech.

Apart from facial movements, we also add loss related to eye corner movements during training, since there is almost no influence from voice input for blinking.

Thus, the final optimized loss is LxMod +Leyelid.

Perceptual Evaluation

A total of 100 participants were surveyed in this paper under the following conditions using this approach and the existing methods of VOCA and ground truth data.

  • Comparison is made in three regions: full face, lip sync (chin to nose), and upper face (nose to upper face).
  • In each comparison, 400 pairs of short clips spoken by speakers from the test set were evaluated
  • Participants choose between VOCA is better (competitor), equally good (equal), and this approach is better (ours)

The results of this survey are shown in the table below.

As shown above, more than half of the participants answered that our approach was superior to the existing VOCA method, and when compared to ground truth data, about half of the participants answered that our approach was equal or superior.

Qualitative Examples

Then, let's take a look at a sample animation generated by this approach.

We can confirm that this approach generates very realistic facial animations with accurate voice and mouth synchronization and upper face motions such as blinking and eyebrow-raising.

It is also noteworthy that the mouth movements match the speech parts of the audio data, but the upper facial movements, such as eyebrow-raising and blinking, are generated separately for each sequence.


How was it? In this article MeshTalk, is a model that successfully outputs plausible upper face motion by separating the input into two parts: speech signal and expression signal. MeshTalk is a model that successfully outputs plausible upper face motion by dividing the input into speech and expression signals.

Although this method can generate more realistic facial animations compared to existing methods, it has some limitations, such as the fact that it cannot be executed in real-time on low-cost hardware such as laptop CPUs and VR devices due to its large computational complexity, and the fact that it is difficult to synchronize The limitations of this method include the difficulty of synchronizing it with audio, so it will be interesting to see how future research progresses.

The details of the MeshTalk architecture and generated videos introduced in this article can be found in this paper if you are interested.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us