A New Gesture Generation GAN That Takes Into Account Human Emotions!

GAN (Hostile Generation Network) 30/06/2022

3 main points
✔️ Proposed a GAN-based model to generate upper body gestures considering human emotional expression while maintaining speaker style
✔️ Introducing MFCC Encoder, Affective Encoder, etc. to learn latent emotion features
✔️ Confirmed gesture generation that achieves state-of-the-art across multiple metrics

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning
written by Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, Dinesh Manocha
(Submitted on 31 Jul 2021)
Comments: ACM 2021
Subjects: Multimedia (cs.MM); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

the co-speech gesture is a body expression related to human speech, and there are various types of gestures such as Beat gesture (= rhythmic gesture associated with speech), Iconic gesture (= gesture to express physical concepts such as big or small by opening or closing arms ), Metaphoric gesture (= gesture to express abstract concept such as "love" by putting hands on chest), etc. Metaphoric gesture (= gesture to express abstract concepts such as "love" by putting the hand on the chest), and so on.

The generation of such co-speech gestures is an important task for creating attractive characters and virtual agents in modern society, and various models for generating co-speech gestures have been proposed.

However, one of the problems with existing methods is that they cannot generate gestures that take these emotional expressions into account, even though humans are known to change their gesture styles depending on their emotions (e.g., when they are angry, they wave their arms faster and move their heads). This is a problem that has been

In this paper, we describe the MFCC Encoder and Affective Encoder, and so on, and we introduce a GAN-based model that can generate upper body gestures with such emotional expressions.

Model Overview

As shown in the figure below, this model consists of a Generator, which consists of four Encoders, and a Discriminator, which discriminates between the gesture generated by the Generator and a real gesture.

Generator

The Generator of this method consists of the following four Encoders.

1.MFCC Encoder

MFCCs (Mel-Frequency Cepstral Coefficients ) are features based on auditory filters that are commonly used in speech recognition. In this method, MFCC Encoder is designed to incorporate the emotional features obtained from speech such as intonation obtained by MFCCs into gesture generation.

2. Text Encoder

A Text Encoder is used to process text manuscripts that correspond to speech. Our method converts word sequences into features by using trained FastText word embedding models for text manuscripts.

3. Speaker Encoder

Speaker Encoder uses one-hot vectors for Speaker IDs, which are then trained on two sets of all-join layers.

4. Affective Encoder

In this method, we propose an encoding mechanism to convert pose-based emotional expressions into features. Since gestures usually consist of the trunk, arm, and head movements, we consider 10 joints corresponding to these parts. Here, the joints are the vertices and the edges from the trunk to the limbs are the directed graphs, and Encoder is trained for the direction of the edges. To deal with hierarchical encoding, we use STGCNs (Spatial-Temporal Graph Convolutions).

Finally, the feature sequences obtained by the four Encoders are concatenated and trained in a Bi-GRU (Bidirectional Gated Recurrent Unit = Bidirectional GRU), and then passed through the all-combining layer and Leaky ReLU to generate gestures.

Discriminator

The Discriminator in our method takes the gestures generated by the Generator and computes the feature sequence using the Affective Encoder.

Afterward. After using Bi-GRU on this feature sequence, summing the bi-directional outputs by all the coupling layers, and applying a sigmoid function, the Discriminator identifies whether the gesture is real (= gesture generated from the dataset) or fake (= gesture generated by the Generator ) The Discriminator then applies the gesture to the dataset.

By repeating this sequence of learning in an adversarial manner, gesture generation incorporating emotional expressions is possible.

experiment

In this paper, we conducted two experiments: a comparison with existing methods and a user study on generated gestures. experiments were conducted.

Comparative verification with existing methods

In this paper, comparative verification was performed under the following conditions.

For the dataset, we use the TED Gesture Dataset and the GENEA Challenge 2020 Dataset, which are two benchmarks in gesture generation methods
The TED Gesture Dataset is compared with existing methods such as Seq2Seq, Speech to Gestures with Individual Styles (S2G-IS), Joint Embedding Model (JEM), and Gestures from Trimodal Context (GTC).
For a fair comparison, each method uses a pre-trained model provided by the authors

Below you can see the gesture generation results for two different samples taken from the TED Gesture Dataset.

The generated gestures are, from the top, the original speaker's gesture, the gesture generated by GTC (the current state-of-the-art gesture generation method), the gesture generated by the proposed model without the MFCC Encoder&Affective Gestures generated by the proposed model without Encoder (Ablation study), and gestures generated by the proposed model in this paper.

From these results, we can see that

In the absence of MFCC Encoder, it matches the content of the speech but is not able to generate gestures that take into account the emotional features of the speech
- For example, when using the words "I was" or "I believe", it is possible to generate a gesture to point at the speaker himself, but it is not possible to generate expressions like "bored".
Without Affective Encoder, the generated gestures only show slight body movements and do not take into account important emotional expressions
On the other hand, the proposed model in this paper can generate appropriate emotional expressions for speech
- For example, "excited" will show a quick movement of the arm, and "bored" will show a drop of the arm/shoulder gesture.

An investigation by user study into generative gestures

In this paper, a user study was conducted under the following conditions.

To determine the extent to which the generated gestures matched the emotional expressions, 24 participants were surveyed
Each participant was surveyed using gestures corresponding to speeches taken from the TED Gesture Dataset
The three types of gestures used in the study are the original speaker's gesture, the gesture generated by the proposed model in this paper, and the gesture generated by GTC
Participants respond to two questions on a 5-point scale from 1 to 5 (1 being the worst and 5 being the best)

The figure below shows the results of the user study for two questions: (a) how plausible the gesture seems, and(b) how well the gesture matches the emotional expression.

In (a), 15.28%more participants responded with 4 or 5 compared to the gestures generated by GTC and3.82% more compared to the original speaker's gestures, indicating that participants judged the gestures generated by our method to be better than existing methods and of equal quality to the original data.

In (b), 16.32%more participants responded with 4 or 5 compared to the GTC-generated gesture and4.86% more compared to the original speaker's gesture, indicating that participants judged the generated gesture to be appropriately in sync with their emotions.

Thus, both the comparative verification with the existing method and the user study on the generated gestures show that the gestures generated by our method are enough to take the speaker's emotion into account. sufficiently takes the speaker's emotions into account.

summary

What do you think? Generating more human-like gestures is useful in various multimedia applications such as counseling and robotic assistants, and this field is expected to develop further in the future.

On the other hand, however, there are some points to be improved, such as the inability to handle expressions such as sarcasm that do not match the content of speech and emotion, and the fact that it is limited to generating upper body gestures. The architecture of the model introduced in this article and the details of the generated gestures can be found in this paper, so please refer to it if you are interested.