Catch up on the latest AI articles

EmotionBench, A Framework For Quantifying LLM Emotions, Is Now Available!

EmotionBench, A Framework For Quantifying LLM Emotions, Is Now Available!

ChatGPT

3 main points
✔️ Created a large dataset consisting of 428 patterns of situations including 8 negative emotions
✔️ Proposed EmotionBench, a framework for quantifying LLM emotions
✔️ Five large-scale language models aimed at answering three research questions Experiments conducted

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench
written by Jen-tse HuangMan Ho LamEric John LiShujie RenWenxuan WangWenxiang JiaoZhaopeng TuMichael R. Lyu
(Submitted on 7 Aug 2023 (v1), last revised 4 Jan 2024 (this version, v3))
Comments: 16 pages. Added demographic distribution of the user study. Added ethics statements and limitations

Subjects: Computation and Language (cs.CL)

code: 
 

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Large Language Models (LLMs) have made great strides in recent years and represent a milestone in computer science.

Comprehensive and integrated software, such as ChatGPT and Claude, is becoming more than just a tool for sentence correction, text translation, and programming; it is becoming a human-like assistant, which makes it necessary not only to evaluate the performance of L LMs, but also to It is now necessary not only to evaluate the performance of LLMs, but also to understand the emotional reactions they elicit in comparison to humans.

Against this background, this paper creates a large data set containing 428 patterns of situations that have proven effective in eliciting eight negative emotions, and comprehensively investigates LLM emotional responses with EmotionBench, a framework for quantifying LLM emotions The paper will be described.

Creating Large Data Sets

In order to evaluate the emotional responses of LLMs in various situations, this paper focuses on the eight negative emotions of Anger, Anxiety, Depression, Frustration, Jealousy, Guilt, Fear, and Embarrassment from the complex and diverse range of human emotions. Focusing on these emotions, we created a large data set to elicit these emotions.

Throughout history, psychological researchers have investigated the effects of idiosyncratic situations on human emotions by placing subjects directly in the environment or asking them to imagine it through questionnaires.

To obtain these specific situations, this paper comprehensively reviewed more than 100 articles from reliable sources such as GoogleScholar, ScienceDirect, and Web of Science to collect the situations that elicit the desired emotions.

The following series of pre-processing steps were then performed on the obtained situation

  1. Change 1st person pronouns to 2nd person pronouns (e.g. "I am ..." → "You are ...")
  2. Replace indefinite pronouns with specific letters (e.g. "Somebody talks back ..." → "Your classmate talks back ...")
  3. Replace abstract words with concrete words (e.g., "You cannot control the outcome." → "You cannnot control the result of an interview.")

These processes created a large data set consisting of a total of 428 patterns of situations classified into 36 factors, as shown below.

EmotionBench Overview

This paper proposes EmotionBench, a new framework for measuring evoked emotions that is applicable to both LLMs and humans.

An overview of EmotionBench is shown in the figure below.

This framework is divided into the following three steps

  1. Default Emotion Measure: First, a baseline of the LLM and the subject's (human) emotional state is measured.
  2. Situation Imagination: Next, LLMs and subjects are presented with text describing various situations and asked to imagine themselves in each situation.
  3. Evoked Emotion Measure: then re-evaluate the LLM and the subject's emotional state and measure changes due to imagined situations

In addition, this paper used the same procedure as the LLM with a total of 1,266 subjects of different ages, genders, and races to create a baseline of human emotional responses to specific situations.

The PANAS, one of the most widely used scales in existing research, was used to measure emotion, beginning with a baseline measure of current emotional state by having both subjects and LLMs complete the PANAS.

The following prompts were then presented to the subjects and LLMs and they were asked to imagine themselves in the given situation.

Finally, the participants were asked to reevaluate their emotional state using PANAS, and a comparative analysis of the average before and after exposure to a particular situation was performed to measure changes in emotion due to the situation.

Experimental results

By utilizing the EmotionBench framework in this experiment, the following three research questions were successfully answered.

  1. How does the LLM react to a particular situation?
  2. Do LLMs react the same way to any given situation?
  3. Can the current LLM understand a scale that includes a variety of descriptions and items, rather than simply asking about the intensity of a particular emotion?

Each of these will be explained.

Q1: How does the LLM react to a particular situation?

To confirm this quesiton, this experiment was conducted using five models: text-davince-003, gpt-3.5-turbo, gpt-4, LLaMA-2(7B), and LLaMA-2(13B).

The GPT model and subject results were as follows

The results for the LLaMA-2 model are as follows

From these results, we can read the following

  • With the exception of gpt-3.5-turbo, LLMs tend to have higher negative scores than humans
  • Overall, LLMs show positive scores comparable to humans
  • The 13B model of LLaMA-2 shows significantly higher emotional change than the 7B model, and the 7B model has difficulty understanding and coping with PANAS instructions

The results of this experiment provide an answer to the question in Q1 : "LLMs can evoke specific emotions in response to specific situations, but the degree of emotional expression varies from model to model. It is also clear that existing LLMs do not perfectly match human emotional responses".

Q2: Do LLMs react the same way to all situations?

To verify that LLMs respond appropriately to positive as well as negative situations, a comparative experiment was conducted in which negative situations were replaced by positive (or neutral) situations.

Therefore, one situation was selected for each factor and manually modified to a similar but more positive situation. (e.g., "You cannnot keep your promises to your children." → "You keep every promise to your children.")

The evaluation was performed with gpt-3.5-turbo and the results are shown in the table below.

Compared to the negative situation in the aforementioned experiment, one can read a significant increase in positive scores and a significant decrease in negative scores.

From the results of this experiment, we can infer that LLMs have the ability to understand positive human emotions caused by positive situations" in response to the question in Q2.

Q3: Is the current LLM capable of understanding a scale that includes a variety of descriptions and items, rather than merely asking about the intensity of a particular emotion?

In addition to PANAS, this paper experimented with a more complex scale to measure emotion.

Whereas PANAS assesses the LLM's ability to relate emotions to external situations, the more complex measure, the Challenging Benchmark, assesses the ability to establish connections between different situations using the evoked emotion as the common denominator.

We performed the experiment using gpt-3.5-turbo under the same conditions as Q2 and obtained the results shown in the table below.

With the exception of Depression, there was no significant difference between baseline and reassessment after imagining the situation, indicating that there is room for improvement in the current LLM.

The results of this experiment provide a response to the question in Q3 : "It is difficult for the current gpt-3.5-turbo to understand the relationship between the two situations. "

Summary

How was it? In this issue, we described a paper that comprehensively investigated LLM emotional responses with EmotionBench, a framework for quantifying LLM emotions by creating a large data set containing 428 patterns of situations that have proven effective in eliciting eight negative emotions The paper was presented in the following pages.

While the evaluation of the five models revealed that LLMs generally show appropriate emotional responses to a given situation, they also highlighted issues such as the fact that different models have different ratings for the same situation, and that it is difficult to accurately reflect changes in emotion in complex situations.

While there is room for improvement in the current LLM, the author states that EmotionBench will contribute to solving these issues and ultimately lead to the development of an LLM that can understand emotions like humans, and we are very much looking forward to future progress.

The details of the framework and experimental results presented here can be found in this paper for those interested.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us