Innovative Speech Emotion Recognition: Exploring Gender Information Integration And Advanced Pooling Methods Using WavLM Large

Large Language Models 18/10/2024

3 main points
✔️ Investigate different pooling methods and incorporation of gender and textual information to improve the accuracy of speech emotion recognition.
✔️ Proposes a method to improve the accuracy of emotion classification using gender labels and text annotations.
✔️ Experiments with the MSP Podcast corpus showed that standard deviation pooling performed best.

Adapting WavLM for Speech Emotion Recognition
written by Daria Diatlova, Anton Udalov, Vitalii Shutov, Egor Spirin
(Submitted on 7 May 2024)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

SER (Speech Emotion Recognition) is in increasing demand in a variety of fields such as customer service, medicine, and virtual assistants, where it can be used to automatically detect a speaker's emotional state from speech data to measure customer satisfaction or to monitor mental health. It can be used to measure customer satisfaction or to monitor mental health. This research explores approaches for emotion recognition from speech using Self-Supervised Learning (SSL) models, which can learn effectively even when labeled data is scarce, especially by leveraging large-scale pre-trained models. and can extract useful features from unlabeled data.

Related Research

WavLM Overview

WavLM is a self-supervised speech processing model with a transformer-based architecture. The model is pre-trained on a large corpus of speech and performs well on tasks such as speech denoising and masked speech prediction. In particular, WavLM is able to capture fine-grained features in speech, which allows for more accurate identification of emotional nuances.

What is Self-Supervised Learning (SSL)?

SSL is the process by which models learn features using unlabeled data. This method can leverage large amounts of unlabeled data to learn powerful representations for use in downstream tasks. In the context of speech emotion recognition, the SSL model serves as prior knowledge for extracting emotional features from speech data, enhancing learning with limited labeled data.

Proposed Method

This research applies several new methods to WavLM to improve the accuracy of speech emotion recognition. These include temporal dimensional pooling, integration of gender information, and utilization of textual data.

Time Dimension Pooling

Standard deviation pooling and attention pooling were introduced to capture the temporal characteristics of speech data. These techniques aim to highlight features of speech that are important for emotion recognition. Standard deviation pooling computes the deviation from the mean and captures emotional intensity and variability. Attention pooling allows the model to focus on important time frames and provides a better understanding of context in emotion identification.

Use of Gender Information

Gender is known to influence emotional expression, and incorporating this information into the model is expected to result in more accurate emotion recognition. The use of gender information provides additional cues to the model to identify different expressions of emotion in the same utterance.

Integration of Text Information

The textual content of an utterance is another important factor that helps in understanding emotion. In this study, the textual information corresponding to an utterance is encoded using the Sentence Transformer, and the resulting textual embeddings are combined with speech features to increase the contextual depth of emotion recognition.

Experiment

Experimental Setup

The experiment was conducted using the MSP Podcast Corpus and is divided into training, development, and test sets. The dataset contains 90,522 utterances, with each utterance assigned an emotion label. The experiment specifically used the development set to evaluate the performance of each model.

Impact of Pooling Methods

In experiments using standard deviation pooling and attention pooling, standard deviation pooling achieved the highest F1 macro-score (see Figure 1). This demonstrates the effectiveness of using variability in standard deviation skill in capturing the subtle nuances of emotion.

Impact of Gender Information

In the experiment with the addition of gender information, we integrated the gender information in two ways, "summing" and "multiplying," and observed that both improved performance (see Table 2). This suggests that since gender is closely related to emotional expression, taking gender information into account allows the model to more accurately identify emotions.

Consideration

In this study, we investigated different fine tuning methods for speech emotion recognition using the WavLM Large model. Through experiments, we have gained a deep understanding of the impact of pooling methods and integration of additional information on the performance of the SER model. These findings contribute to the evolution of emotion recognition techniques, but they are accompanied by several important considerations.

Emotional Complexity and Model Adaptability

Emotions have very complex and multi-layered characteristics, making it difficult to capture all of them with a single feature or method. The high performance of standard deviation pooling may be due to its ability to capture minute changes in emotion by focusing on the most volatile parts of the emotional expression. However, this approach may not be optimal for all emotions and contexts, and must be adjusted for each scenario.

Methods of Integrating Information and Their Effects

It is clear that the way in which gender information is integrated has a significant impact on the performance of the model. This result suggests that how additional information is integrated into the model is important. Integrating not only gender, but other personal identifiers (e.g., age, region, etc.) as well, may enable even more accurate emotion recognition. However, careful consideration is needed because this information does not necessarily contribute to accurate emotion recognition and in some cases may cause the model to be biased.

Model Versatility and Specialization

Whether the high performance in the development set can be reproduced in a variety of real-world scenarios is another important question. Because of the gap that exists between laboratory conditions and real-world conditions, further validation is needed to increase the versatility of the model. It may also be beneficial to develop emotion recognition models specific to particular cultures and languages, considering global applications.

Conclusion

Through fine tuning of the WavLM Large model, it was confirmed that standard deviation pooling and integration of gender information, in particular, contribute to improved performance in speech emotion recognition. However, the integration of textual information did not have the expected effect, and further improvements in this direction are needed. Future research should explore the development of more emotion-sensitive text encoders and more effective methods of integrating textual information.

Categories related to this article

Sasayama