The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

02/10/2025

3 main points
✔️ Social-MAE is a self-supervised multimodal model that integrates face and voice processing
✔️ Pre-trained with VoxCeleb2 and applied to emotion recognition, laughter detection, and personality estimation
✔️ Experiments show accuracy over existing methods and effectiveness in understanding social behavior

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
written by Hugo Bohy, Minh Tran, Kevin El Haddad, Thierry Dutoit, Mohammad Soleymani
(Submitted on 24 Aug 2025)
Comments: 5 pages, 3 figures, IEEE FG 2024 conference
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

Human social behavior and emotional expressions are transmitted through multiple modalities such as facial expressions and voice.
However, in order to accurately understand actual social situations, multimodal processing that combines both modalities is required.

However, it is difficult to prepare large-scale labeled data on emotions and social behaviors, and conventional methods rely on transfer learning of models trained on general-purpose data or training on small-scale data.

Social-MAE proposed in this study is a Transformer-based multimodal self-supervised learning model based on a masked autoencoder (MAE) that simultaneously processes faces and speech.
It was pre-trained on the large social interaction dataset VoxCeleb2 and subsequently applied to downstream tasks such as emotion recognition, laughter detection, and apparent personality estimation.

As a result, Social-MAE achieves state-of-the-art performance, demonstrating the effectiveness of self-supervised learning to integrate multimodal information.

Proposed Methodology

Social-MAE is a multimodal automatic encoding model with a structure that extends the existing CAV-MAE.
The distinctive feature of this model is that it uses eight frames instead of a single frame as video input, which enables it to capture temporal changes in facial expressions with high accuracy.

The architecture is such that audio and video are each processed by a Transformer-based dedicated encoder and then integrated by a joint encoder.
In addition to the MAE mechanism, which masks part of the input and reconstructs the missing parts, the system combines contrast learning to align feature representations between audio and video.

This allows for the extraction of common information across modalities, while also retaining representations specific to each modality.
A large audio/video dataset, VoxCeleb2, was used for training, and representations were acquired from unlabeled data through self-supervised learning.

This approach enables pre-trained representations specific to social behavior recognition, and provides the flexibility and versatility to adapt to a variety of downstream tasks with small amounts of labeled data.

Experiments

To confirm the effectiveness of the proposed method, we compared its performance with CAV-MAE and existing baselines on three downstream tasks.

First, for emotion recognition, we used the CREMA-D dataset and performed six emotion classifications, including anger, joy, and sadness.
As a result, Social-MAE achieved an F1 score of 0.837, showing better accuracy than existing models.

Next, apparent personality estimation using the ChaLearn First Impressions dataset regressively predicted Big Five traits such as extraversion and cooperativeness.
Social-MAE achieved an average accuracy of 90.3%, comparable to conventional methods despite a small number of epochs. In addition, the NDC-ME data is also used to predict the Fibe properties in the regression.

In addition, it achieved an F1 score of 0.776 for the detection of laughter and smiles on the NDC-ME dataset, significantly outperforming conventional CNN-based methods.

These results confirm that the introduction of self-supervised pre-training and multiple frame processing dramatically improves the performance of social behavior understanding.

Categories related to this article

nakata

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

Overview

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects

ROSE: A New Method And Benchmark For Video Object Removal With Side Effects