Catch up on the latest AI articles

A 3D Mesh Of A Face Resembling The Speaker Can Be Generated From Speech Alone

A 3D Mesh Of A Face Resembling The Speaker Can Be Generated From Speech Alone


3 main points
✔️ Extended the existing dataset Voxceleb to create Voxceleb-3D, a paired voice and face mesh dataset
✔️ Proposed Cross-Modal Perceptionist, a framework for reconstructing 3D face meshes from voice data only
✔️ Enables video editing methods with emotion control capabilities not available in existing methods

Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
written by Cho-Ying WuChin-Cheng HsuUlrich Neumann
(Submitted on 18 Mar 2022)
Comments: Accepted to CVPR 2022

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning(cs.LG); Audio and Speech Processing(eess.AS)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Since the human voice is generated by the articulatory structure of the vocal cords, facial muscles, and facial skeleton, it has been shown that there is a correlation between the voice and the shape of the speaker's face, and research has been conducted to generate plausible face images from the speaker's voice alone using this correlation.

However, in such tasks that generate face images from audio, the problem has been raised that it is not possible to predict factors that do not correlate with audio, such as background, hairstyle, and facial texture.

The paper presented in this paper is based on the hypothesis that "a 3D mesh with less noise as described above may be able to predict the shape of a speaker's face more accurately?" This is the first paper to investigate the correlation between speech and the shape of a person's 3D face based on this hypothesis.

The two main contributions of this paper are as follows.

  • Created Voxceleb-3D, a new dataset for generating 3D meshes of a speaker's face from speech
  • Proposed Cross-Modal Perceptionist, a framework for reconstructing 3D face mesh from audio data only

Let's take a look at each of them.


The main goal of this thesis is to investigate the correlation between speech and the shape of a person's 3D face, which requires the acquisition of a large 3D face dataset.

To address this issue, in this paper we use the same approach used in existing studiesVoxcleb ( a large dataset consisting of celebrity speech utterances ( ) and VGGFace ( a large face image dataset).Voxceleb-3D, a new dataset consisting of pairs of speech utterances and 3D face data of the speaker. (The following figure shows a sample face mesh)

Specifically, we obtained the intersection of audio and image data from Voxceleb and VGGFace through existing research and adapted 3D face data from 2D images using the optimization approach employed in the most prominent 3D face dataset, 300W-LP-3D, to create Voxceleb-3 3D is created.

Details of the voice data, face images, 3DMM parameters, and gender ratios included in Voxceleb-3D are shown in the table below.

Out of all 1225 speakers in the dataset, those whose names start with A to E are divided into an evaluation dataset and the rest are divided into a training dataset.

Cross-Modal Perceptionist

Cross-Modal Perceptionist learns a 3D face mesh from speech using 3D Morphable Models (3DMM), an existing 3D face generation model using Principal Component Analysis, and uses supervised and unsupervised learning to analyze the correlation between speech and 3D face shape.

Supervised Learning with Voice/Mesh Pairs

First, we explain the supervised learning method shown in the figure below.

Initially, when a pair of speech and 3DMM parameters are input, we extract the speech embedding from the input speech using Mel-Spectrogram.

Next, following existing research, we pre-trained the speech encoder Φv is pre-trained on a large-scale speaker recognition task, and then the decoder Φdec to learn and estimate the 3DMM parameters α (The ground truth parameter α* is used to compute the Supervised Loss )

Due to various problems with acquiring 3D face data, such as being very expensive, privacy restrictions, and time-consuming 3D MMM fitting with facial landmarks, unsupervised learning is considered to be a viable practical approach.

Therefore, in this paper, we propose a framework for unsupervised learning with knowledge distillation shown in the figure below.

This framework is a handy tool for the

  1. Synthesis of 2D face images from audio using GAN
  2. 3D modeling from synthesized face images

The system consists of two stages, and by using a well-trained supervised model, it is possible to obtain the face shape not only by the actual face scan but also by the optimized 3D MMM parameters.

In summary. The overall picture of Cross-Modal Perceptionist is shown in the figure below.

where c-kasb is the convolution layer of the c-channel output with kernel size a and stride b, and d in the linear layer means that it outputs a d-dimensional vector.


In this paper, we use the dataset described above with Voxceleb-3D used to CMP and existing methods and evaluated the results through user studies.

Comparative verification with existing methods

In this paper, the following metrics and baselines used for comparative validation were conducted using the following metrics and baselines.

valuation index

The evaluation metric used in this paper is ARE (Absolute Ratio Error), which is used in existing methods, and the distance is measured and compared as shown below.

Each evaluation index is calculated as the ratio of the distance between the two ears, ER (ear-to-ear ratio) = AB (distance between the two ears) / EF (distance between the two outer eyes ), and these indexes can capture the extent to which the generated faces are deformed.


In this verification, we constructed a baseline to generate 3D mesh from audio as shown in the figure below by directly cascading two separately trained methods: a GAN-based audio-to-image conversion model and an image-to-3D mesh conversion model.

The table below shows the results of the CMP and comparative validation of this paper using these metrics and baselines.

From these results, we can see that

  • Compared to the baseline with directly cascaded pre-trained existing models, the cross-modal learning with CMP gave very good results (about 20% improvement )
  • These improvements revealed a correlation between voice and face shape, indicating that learning 3D face mesh prediction from voice information is adequate.
  • Of all the measures, ER showed the most significant improvement, indicating that face width may be the most effective measure for prediction by voice information

The results demonstrate that cross-modal learning with CMP can generate 3D face meshes from audio information with very high accuracy.

In addition, to test the hypothesis that face width is the most effective index for prediction based on voice information, we conducted a comparative test using various face shapes.

Face meshes from our supervised learning

The figure below shows four types of face shapes (Skinny=Slim, Wide=Long face, Regular=Normal, Slim=Slim) and reference images used for comparison and verification.

As shown in the figure, the supervised learning of CMP was able to generate a face mesh that matched the face shape of the reference image, and the result verified the hypothesis obtained from the above comparative validation.


How was it? In this case, we created a new Voxceleb-3D, which is a dataset of voice and face mesh pairs. Voxceleb-3D is a newly created dataset of voice and face mesh pairs. Cross-Modal Perceptionist This is a model that enables the generation of 3D face meshes from audio data alone.

This model not only demonstrates that high-quality 3D face meshes can be generated from audio data alone but also shows that The most effective measure for prediction based on audio information is the face width the most effective prediction metric is the width of the face.

However, there are still some issues to be addressed, such as the difficulty of generating detailed facial details such as facial irregularities and wrinkles from audio alone, and the possibility that changes in audio due to health conditions such as after smoking or drinking may affect the quality of the generation.

The details of the architecture of the model presented here and the generated 3D face mesh can be found in this paper if you are interested.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us