Catch up on the latest AI articles

A Model That Generates The Listener's Response To A Conversation Is Now Available!

A Model That Generates The Listener's Response To A Conversation Is Now Available!


3 main points
✔️ Proposed motion-audio cross-attention transformer to synthesize speaker motion and speech modalities
✔️ Introduce sequence-encoding VQ-VAE to learn discrete latent representations of listener's movements
✔️ Create a large dataset consisting of videotaped duo conversations

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
written by Evonne NgHanbyul JooLiwen HuHao LiTrevor DarrellAngjoo KanazawaShiry Ginosar
(Submitted on 18 Apr 2022)
Conference on Computer Vision and Pattern Recognition (CVPR) 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In a two-person conversation, coordination between speaker and listener is an essential element, and existing research has shown that nonverbal feedback from the listener, such as head movements, is more important than the content of the response in maintaining the flow of the conversation. nonverbal feedback from the listener, such as head movement, is more important than the content of the response in maintaining the flow of the conversation.

However, modeling nonverbal feedback in these face-to-face conversations is

  • Speakers are multimodal because they communicate both verbally through speech and non-verbally through facial and body movements
  • The listener's response is non-deterministic.

This makes it a very difficult task, and we needed to address both of these challenges to model natural dialogue.

In this paper, we present a paper that solves the above two problems with a data-driven approach using a novel and large dataset consisting of a pair of conversations.

In this paper, as shown in the figure below, the model extracts speech and facial movements from the input video of a speaker and enables the generation of various responses of the listener synchronized with the speaker.

The motion-audio cross-attention transformer, which synthesizes the modality of the speaker's movements and speech, and the sequence-encoding VQ-VAE, which learns discrete latent representations of the listener's movements, are used to predict the corresponding listener responses in an autoregressive manner from multimodal information about the speaker. The VQ-VAE can predict the corresponding listener's response to the speaker's multimodal information in an autoregressive manner.

Let's take a closer look at this model and dataset.

Model Overview

This paper aims to model the reciprocal responses of speakers and listeners in face-to-face communication, and to achieve this goal, we set up an autoregressive task to predict the corresponding listener's facial movements given a 3D face model of a speaker and a voice.

As shown in the figure below, the model predicts a distribution over the corresponding listener responses, conditional on multimodal input from the speaker.

To model the speaker's speech and facial motion, this paper proposes a novel transformer, the motion-audio cross-modal transformer, to learn by fusing the two modalities.

In addition, by using sequence-encoding VQ-VAE, which extends VQ-VAE to the motion synthesis domain and allows learning of discrete latent spaces, we have successfully predicted multinomial distributions for the listener's following time step response.

The resulting output is a distribution over the listener's responses synchronized to the speaker, from which multiple motions can be sampled.

Conversational Dataset

With the recent popularity of COVID-19, there has been a trend toward split-screen videoconferencing platforms where the speaker and listener are on one side of the screen when conducting a video interview. This type of platform is very convenient for studying face-to-face communication because both parties are directly facing the camera.

Against this background, in this paper, we extracted 72 hours of facial movements and audio from six YouTube channels and created a training dataset containing various scenes and facial expressions of people.

For the thus obtained video, we use DECA, an existing facial expression extraction method, to recover the head posture and facial expression of the 3D model from the live-action video. (A sample video of the dataset can be found here.)

In training, the prediction model is learned using these facial expressions, postures, and the speaker's voice as a pseudo-ground truth.


In this study, a comparative validation with the baseline was conducted to evaluate the effectiveness of the proposed model to transform the speaker's speech and actions into the listener's actions.

Verification of comparison with baseline

In this paper, we evaluate the predictions of our model along multiple axes based on the idea that conversational listeners should show (1) realistic,(2) diverse, and(3) synchronized movements with the speaker's actions.

Specifically, we evaluate the facial expressions (expression) and movements (rotation) of the listener separately according to the evaluation indexes below.

  • L2: Euclidean distance of the coefficient of ground truth expression divided by the pose value
  • Frenche Distance(FD ): value measured by the distribution distance between the generated motion sequence and the ground truth motion sequence
  • variation: variation of motion throughout the sequence
  • SI: Variety of predictions
  • Paired FD (P-FD): a value that measures the degree of synchronization based on the distribution distance between a listener and a speaker pair
  • PCC: Pearson Correlation Coefficient, a measure often used to quantify global synchrony in psychology

We also compared these metrics to the following baselines

  • NN motion: a segment search method often used in graphic composition, which, given a speaker motion, finds its neighbors in the training set and uses the corresponding listener segment as a predictor
  • NN audio: The same method as above, but uses audio embedding from pre-trained VGGish
  • Random: returns a 64-frame motion sequence of a randomly selected listener from the training set
  • Median: Returns the median value of facial expressions and postures from a training set
  • Mirror: Returns a smoothed version of the speaker's movements
  • Delayed Mirror: returns the smoothed motion of the speaker delayed by 17 frames (about 0.5 seconds)
  • Let's Face It (LFI): SOTA's 3D avatar generation method is re-trained on the dataset of this paper
  • Random Expression: Returns a random expression at each time step.

The verification results are shown in the table below.

From the table, it can be seen that our method achieves the most balanced performance in various evaluation indices, and the results show that the realism, diversity, and synchronization of the listener's movements with the speaker are well achieved.


How was it? In this article, we have described a novel sequence-encoding VQ-VAE model that aims to model the synchrony of speaker and listener motions in a two-person conversation, and that can synthesize a motion-audio cross-attention transformer corresponding to multiple modalities of speaker input and a listener's non-deterministic response. A novel sequence-encoding VQ-VAE was described for the proposed model.

However, the dataset was generated by remote video recording, so there are still some issues such as the lack of direct eye contact and the delay caused by the remote connection. The future trends of the project will be watched closely.

The details of the architecture of the datasets and models introduced in this article can be found in this paper if you are interested.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us