Comprehensive Evaluation Of Generalized Emotion Recognition (GER) Using The GPT-4V
3 main points
✔️ First study to quantitatively evaluate GPT-4V's performance in emotion recognition
✔️ Quantitative evaluation of GPT-4V on five tasks: visual emotion analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition
✔️ GPT-4V shows superior performance in visual emotion analysis, outperforming supervised results, but performance in micro-expression recognition is degraded due to the need for expertise
GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition
written by Zheng Lian, Licai Sun, Haiyang Sun, Kang Chen, Zhuofan Wen, Hao Gu, Bin Liu, Jianhua Tao
(Submitted on 7 Dec 2023 (v1))
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Emotion recognition has received much attention from researchers because emotions play an important role in human-computer interaction. Current research in emotion recognition focuses on two main aspects: one is to identify the emotions elicited by stimuli and to predict how viewers will feel after viewing these stimuli. The second is to analyze human emotions in images and videos. In this paper, we collectively read these tasks as Generalized Emotion Recognition (GER).
Emotions are conveyed through a variety of modalities, including text, audio, and video. Among them, visual information (color, brightness, facial expressions, human behavior, etc.) contains rich content related to emotions andplays an important role in thegeneralized emotion recognitiontask. To improve visual comprehension, researchers have proposed various algorithms and have made remarkable progress. With the development of deep learning, currentresearch ingeneralized emotion recognitionhas shifted from manual feature design to deep neural networks.
Recently, GPT-4V has shown impressive visual comprehension abilities in a variety of tasks.This raises the question of the extent to whichGPT-4Vcan solve the problem ofgeneralized emotion recognitionand what future research directions may be pursued after the advent of GPT-4V.
In September 2023, GPT-4V was integrated into ChatGPT and user reports investigating its visual capabilities were released. However, these reports generally provide only qualitative insight into GPT-4V due to the limited number of samples per task; OpenAI released an API in November 2023, initially limited to 100 requests per day. As such, it remained difficult to evaluate GPT-4V against state-of-the-art systems for benchmark data sets. Recently, OpenAI has increased the daily request limit to allow for a more comprehensive evaluation.
This paperprovides quantitative evaluation results for GPT-4V in ageneralized emotion recognitiontask, covering visual emotion analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition.
The figure below shows the overall results of GPT-4V, which outperforms random guessing but still falls short of supervised systems. To shed light on why, we comprehensively analyze GPT-4V's multifaceted performance, including its multimodal fusion ability, temporal modeling ability, color space robustness, and prediction consistency.
It is intended that this paper will provide suggestions for subsequent researchers and raise questions about which tasks GPT-4V can effectively address and which need further exploration.
Experiment Summary
In this paper, we have conducted a comprehensive evaluation across five tasks on 19 datasets. The table below provides statistics for each dataset.
The figure belowshows a sample ofeach dataset. There are a variety of data sets, including data sets collected in natural environments (e.g., AffectNet) and data sets collected in controlled laboratory environments (e.g., CASME and CK+), as well as data sets using grayscale images (CK+) and data sets using RGB images (CASME and AffectNet), and a variety of other datasets.
Regarding the five tasks, the first one, Visual Emotion Analysis, aims to identify the emotions induced by images. The four datasets used are Twitter I, Twitter II, ArtPhoto, and Abstract; Twitter I and Twitter II are collected from social websites; Twitter I is raw from Amazon Mechanical Turk workers ArtPhoto contains art photos from photo-sharing websites, and Abstract consists of abstract paintings evaluated by peers. These datasets are reclassified into two classes, positive and negative, and the results of the negative/positive classification task are reported.
Five benchmark datasets are used for facial emotion recognition: CK+, FERPlus, SFEW 2.0, RAF-DB, and AffectNet. SFEW 2.0, RAF-DB, and AffectNet include RGB images. Specifically, CK+ contains 593 video sequences from 123 subjects, and the last three frames of each sequence were extracted to construct the dataset.
FERPlus is an extension of FER2013, where each sample is relabeled by 10 annotators; SFEW 2.0 extracts keyframes from movie clips and includes diverse head poses, shielding, and lighting; RAF-DB includes thousands of basic and complex facial expressions samples, and AffectNet has 8 labels, each label containing 500 samples.
Micro-expression recognition also aims to identify subtle changes in the human face. The assessment uses vertex frames and concentrates on key emotions; CASME includes 195 samples across eight categories and focuses on four main labels (nervousness, disgust, depression, and surprise); CASME II includes 247 samples collected from 26 subjects and focuses on five main labels (happiness, surprise, disgust, depression, and other); SAMM includes 159 samples, with ratings limited to labels from 10 or more samples (anger, contempt, happiness, surprise, and other).
Dynamic facial emotion recognition focuses on more challenging image sequences. Four benchmark datasets (FERV39k, RAVDESS, eNTERFACE05, and DFEW) are used in this task. The first three datasets use the official training/validation/testing split and evaluate performance on the official test set; DFEW has five folds containing 11,697 samples and only reports results for fold 1 (fd1) to reduce evaluation costs We report only results for Fold 1 (fd1) to reduce evaluation costs.
In addition, multimodal emotion recognition aims to integrate different modalities such as audio, video, and text to identify emotions. Three benchmark datasets (CH-SIMS, CMU-MOSI, and MER-MULTI) are used in this task: CH-SIMS and CMU-MOSI provide an emotion intensity score for each sample, and the evaluation concentrates on the negative/positive classification task; MER- MULTI is a subset of the MER2023 dataset and provides discrete and dimensional labels for each sample. This paper focuses on discrete emotion recognition performance.
GPT-4V Call Strategy
This paper evaluates the performance of the latest GPT-4V API, gpt-4-vision-preview.Thegeneralized emotion recognitiontask includes a variety of modalities, including image, text, video, and audio, but the current GPT-4V version has limitations, supporting only image and text input. To process video data, video is sampled and converted into multiple images. For audio data, we have attempted to convert it to a mel spectrogram, but GPT-4V has not been able to generate an adequate response for this input.Therefore, in this paper we focus our evaluation on images, text, and video; we propose a batch-by-batch invocation strategy and a recursive invocation strategy to accommodate API request limits and reduce the number of rejection cases due to security checks.
The current GPT-4V API has three request limits: token per minute (TPM), request per minute (RPM), and request daily (RPD). This creates additional requirements for prompt design.
To address the RPM and RPD constraints, we follow prior research and employ batch-wise input. That is, multiple samples are input to GPT-4V and a request is made to generate a response for each sample. However, large batch sizes may result in the total number of tokens exceeding the TPM limit. Additionally, it increases the difficulty of the task and may result in erroneous output. For example, a batch input of 30 samples may receive only 28 predictions. Therefore, we set the batch size to 20 for image-level input and 6 for video-level input, adjusting the TPM, RPM, and RPD to simultaneously meet the three API limits.
The prompts for each task are shown in the table below.
Also, during the evaluation, the GER task tends to trigger security checks in the GPT-4V. This is related to the visual emotion analysis and human emotion recognition tasks. The former task involves violent images, while human identity is considered sensitive information in the latter task.
To reduce these errors, we require GPT-4V to ignore the speaker's identity at the prompt. However, security errors may still occur. These errors occur randomly. For example, even though all images are human-centric, some will pass the security check and others will fail. Or a sample may fail the check the first time, but pass on retry. By making multiple calls to the same batch, we reduce the number of rejection cases.
Also, if a batch input fails the security check, it may pass the check if this is split into smaller portions. Therefore, for a batch that consistently fails, we split it into two smaller mini batches and input these mini batches into GPT-4V. This operation is repeated until no further splitting is possible. We call this strategy the "recursive call strategy," and the algorithm is as follows.
Experimental Results
First, wereport the performance of different methods onfivegeneralized emotion recognitiontasks. Two heuristic baselines are included: random guessing and majority guessing. In random guessing, labels are randomly selected from candidate categories, while in majority guessing, the most common labels are selected. For both baselines, 10 experiments were conducted and the average results are reported.
The table below showsthe results of visual emotion analysis,showing that GPT-4V outperforms supervised systems on most datasets. This superior performance is due to GPT-4V's powerful image content understanding and inference capabilities, which allow it to accurately infer the emotional state an image evokes.
The table below shows the results for micro-expression recognition, where GPT-4V performed worse than the heuristic baseline. This suggests that GPT-4V is designed for emotions that can be recognized by the general public and is not suitable for tasks that require expertise.
Dynamic facial emotion recognition and multimodal emotion recognitionidentify emotions in video, but GPT-4V does not support video input, so it samples frames evenly from the video and inputs them sequentially. by sampling up to three frames per video, GPT-4V's call cost is reduced. The table below shows the results of facial emotion recognition.
While performance differences still exist between GPT-4V and supervised systems, it is noteworthy that GPT-4V significantly outperforms the heuristic baseline.These results demonstrate the potential of GPT-4V in emotion recognition. The table below shows the results for dynamic facial emotion recognition.
In the table beloware the results for multimodal emotion recognition, whereGPT-4V performs well on CMU-MOSI, but relatively poorly on MER-MULTI. This difference is due to the fact that acoustic information is more important in MER-MULTI than in CMU-MOSI; since GPT-4V does not support speech input, information is lost in MER-MULTI, limiting its performance.
In addition, we are evaluating the multimodal comprehension capabilities of the GPT-4V. Among all tasks, only multimodal emotion recognition provides multimodal information, so we are experimenting with this task. The table below reports unimodal and multimodal results: for CH-SIMS and MER-MULTI, multimodal results outperform unimodal results, indicating GPT-4V's multimodal integration capability. However, CMU-MOSI shows a slight decrease in multimodal results compared to unimodal results. This is because CMU-MOSI relies primarily on lexical information to convey emotion, and the inclusion of visual information can cause interference.
Summary
In this paper, we provide an overall evaluation of GPT-4V on a generalized emotion recognition task: GPT-4V has very good visual comprehension capabilities and outperformed supervised systems in visual emotion analysis. However, we found that its performance was worse in micro-expression recognition, which requires expert knowledge.
The temporal modeling and multimodal fusion capabilities of GPT-4V and its robustness to changes in color space are also presented. In addition, the consistency of predictions and the stability of security checks are evaluated, and error cases are visualized to reveal the limitations of sentiment understanding.
In addition, it serves as a zero-shot benchmark and provides guidance for future research on emotion recognition and multimodal large-scale language models. Future research is expected to expand the scope of the evaluation to include more emotion-related tasks and datasets.
Categories related to this article