Gestures Enable Recognition Of Emotions Not Observed During Training On Training Data!

Zero Shot 23/08/2022

3 main points
✔️ Proposed SC-AAE, a new Zero-Shot Framework based on adversarial autoencoder
✔️ Proposed FS-GER, an algorithm for extracting feature vectors of 3D motion-captured gestures
✔️ 25-27% performance improvement compared to existing methods

Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders
written by Abhishek Banerjee, Uttaran Bhattacharya, Aniket Bera
(Submitted on 18 Sep 2020 (v1), last revised 2 Dec 2021 (this version, v2))
Comments: AAAI 2020
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

The research field of emotion recognition by AI is integral to various domains such as robotics and affective computing, and various methods have been proposed to recognize the emotions of individuals using facial expressions and gestures in speech. Various methods have been proposed to recognize the emotions of individuals using facial expressions and gestures in speech.

However, one of the major challenges in these machine learning-based emotion recognition algorithms has been the huge amount of labeled datasets required to build classification algorithms on emotions.

To solve these problems, various approaches have been proposed and developed to incorporate zero-shot learningZero-shot learning has been proposed and various approaches have been developed.

In this paper, we present a new Zero-shot Framework a new zero-shot framework, SC-AAE, which significantly outperforms existing methods for emotion recognition from gestures.SC-AAE, a new zero-shot framework, significantly outperforms existing methods for emotion recognition from gestures. The paper shows that SC-AAE

SC-AAE Overview

The model outline of SC-AAE is The figure below shows the model outline of the SC-AAE.

In our method, a series of gestures consisting of T (time step) × V (node) × 3 (position coordinates) are captured and feature vectors are generated by Fully Supervised Gesture Emotion Recognition (FS-GER), an emotion recognition algorithm.

It then consists of learning a mapping between Seen classes (classes of emotions used during training) and unseen classes (classes of emotions not used during training ) based on the adversarial autoencoder architecture.

Zero-Shot Learning

First, we explain Zero-Shot Learning, which is mentioned many times in this paper.

Zero-Shot Learning is a research field in machine learning that predicts labels that have never appeared in the training data.

For example, when learning images of a dog and a cat, general machine learning methods utilize labels for the dog and cat, but Zero-Shot Learning uses classes instead of labels.

Specifically, by converting the dog and cat labels into feature vectors of several dimensions rather than a single number, it is possible to identify words that are close in meaning and to infer relationships between data that were not used in training, such as a vector of horses that were not observed during training but is closer to dogs than cats This makes it possible to infer the relevance of data that was not used in the training.

By using this method, this paper aims to use the Seen class (Relief, Shape, Pride), which consists of the emotions observed from the gesture, for training and to detect the Unseen class (Joy, Disgust, Neutral) consisting of emotions not observed in the gestures during validation.

Fully Supervised Gesture Emotion Recognition(FS-GER)

Next, we describe the Fully Supervised Gesture Emotion Recognition (FS-GER) algorithm, which is used for feature extraction in our method. Fully Supervised Gesture Emotion Recognition (FS-GER), is the emotion recognition algorithm used for feature extraction in our method.

The overall diagram of FS-GER is shown below.

The input of this network is a T (time step) × V (node) × 3 (position coordinates) sequence of poses, and since a gesture is a periodic sequence of poses, we use Spatial Temporal Graph Convolutional Networks ( ST-GCN).

Then, Affective Features, which are feature vectors of emotions extracted from the gestures in the preprocessing, are added to the 128-dimensional vector obtained through the 1×1 convolution layer.

Existing research has shown that effective features from gestures are relevant to the problem of emotion recognition, and Affective Features consist of two features

Posture features: extracted from the distance between pairs of joints and the angles and areas formed by the three related joints
Motion features: consists of the acceleration of the relevant joints during the gesture

This feature vector is then passed through the Fully Connected layer and the Softmax layer to generate labels for sentiment classification.

Language Embedding

In our method, we extract a 300-dimensional feature vector about emotion using an existing method, word2vec.

Using this vector representation, it is possible to determine the degree of closeness (=relevance) and disparity (=disparity ) between all the emotions in the data.

In our method, the set of emotions can be expressed as follows

where ^{{ei}∈ℝ300} is a word2vec representation between emotion-words and two particular emotions are related by a Euclidean distance.

The feature vectors obtained by FS-GER and Language Embedding are then passed to separate Discriminators and used for training.

Performance of FS-GER

In this paper, to verify the performance of FS-GER, we compared it with existing methods for emotion recognition under the following conditions.

Train the network from scratch with all body joints in the data as input conditions
The Emotional Body Expressions Database (EBEDB) is used for the dataset
- EBEDB consists of 3D motion capture of natural movement body gestures as an actor narrates certain lines
At this time, from the 11 emotion classes in the dataset, 6 Seen classes and 5 Unseen classes are constructed and classified

The classification accuracy of each method was as follows.

From the table, it is confirmed that our method outperforms the existing methods in classification accuracy by 7-18%.

Evaluation of our Zero-Shot Framework

Next, we compared SC-AAE, the Zero-Shot Framework of our method, with the existing methods.

The harmonic Mean, the evaluation index used for the validation, is the harmonic mean of the classification accuracy of Seen class and Unseen class.

In addition, the following other problems were reported with the existing methods

CADA-VAE (Schonfeld et al. 2019) cannot create key features of the Unseen class when classifying emotions
In f-CLSWGAN (Xian et al. 2018), GANs were conditioned on image classification, but mode collapse was significantly
CVAE-ZSL (Mishra et al. 2018), built for the action recognition task, cannot generate robust features for emotion recognition

The above problems did not occur in SC-AAE, and the validity of this method was confirmed by comparison with existing studies.

summary

How was it? In this article, I'd like to introduce a new Zero-Shot Framework for emotion recognition model using gestures as input. SC-AAE is a new Zero-shot Framework SC-AAE, which is a new zero-shot framework

Although the effectiveness of this method was confirmed by comparison with existing studies, the following issues remain.

The word2vec used in the model is a general language embedding model and is not specific to emotion recognition, so it cannot capture all aspects of psychological and emotional diversity
Need to incorporate more emotional modalities such as speech and eye movements for a more robust classification

It will be interesting to see whether a method that can solve these problems and further improve classification accuracy will emerge in the future. The details of the architecture of the model introduced in this article can be found in this paper, so please refer to it if you are interested.