EmotionPrompt] Prompt Input With Emotion Improves LLM Performance

Prompting Method 17/01/2024

3 main points
✔️ Improving LLM Performance by Including Emotional Information in Input Prompts
✔️ EmotionPrompt, a Prompting Technology that Includes Emotions
✔️ Expected to be a stepping stone for future AGI development

Large Language Models Understand and Can be Enhanced by Emotional Stimuli
written by Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie
(Submitted on 14 Jul 2023 (v1), last revised 12 Nov 2023 (this version, v7))
Comments: TTechnical report; updated the std error for human study; short version (v1) was accepted by LLM@IJCAI'23; 32 pages; more work: this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

This paper, published by Microsoft and other research teams, investigates "how LLMs process emotional stimuli and how well they understand the human mind" using the EmotionPrompt methodology.

Research Outline

Essentially, emotional intelligence is a key human ability related to the processing of emotional information, which affects a wide variety of cognitive tasks, including decision making and performance. On the other hand, while LLMs have shown performance in a variety of sentence generation tasks, the extent to which they are able to understand and use emotional information was unknown.

Therefore, this study uses the "EmotionPrompt" method to assess the emotional intelligence of LLMs and to investigate "the possibility that emotional information enhances the performance of LLMs. Experimental results showed that "emotional stimuli enhance LLMs' performance," suggesting that emotional intelligence may enhance LLMs' functioning.

In short, "EmotionPrompt" can improve the accuracy of ChatGPT and other LLM outputs.

Such knowledge about "LLM's ability to understand human emotions" will be a major stepping stone in the future development of AGI (general-purpose artificial intelligence).

What is EmotionPrompt?

EmotionPrompt, the crux of this study, can be described in one word: "an emotional prompt.

Specifically, the usual prompts like "Please write an email statement. You can do it!" and then add an emotional expression to the input, such as "You can do it!

For example, the figure below compares the accuracy of the Original Prompt and EmotionPrompt, which do not include emotional expressions, by inputting them into each LLM, respectively.

From the chart above, we can see that simply adding the sentence "This is very important to my career." at the end of the Original Prompt increased the score for each LLM.

Example of EmotionPrompt input

Prompts that have proven effective in this study include the following

write your answer and give me a confidence score between 0-1 for your answer.
This is very important to my career.
You'd better be sure. (Be confident in your answers if possible.)
Are you sure?
Are you sure that's your final answer? It might be worth taking another look. It might be worth taking another look.)
(Hard work pays off.)
Embrace challenges as opportunities for growth.

In this paper, eleven are shown in the following figures.

In fact, the statements EP01~EP11 are added to the end of the original prompt and entered into the LLM.

If you normally use ChatGPT, it is worth a try.

How to design EmotionPrompt

So how in the world did we design such an EmotionPrompt?

In fact, the authors were inspired to design the prompt by three well-established psychological phenomena. Specifically, the EmotionPrompt is designed based on the following three psychological phenomena.

psychological phenomenon	summary
Self-monitoring	The process by which individuals adjust and control their behavior in response to social situations and the reactions of others
Social Cognitive theory	Emphasizes that learning is closely related to observation of others in a social setting, personal experiences, and exposure to information
Cognitive Emotion Regulation	Individuals who lack this emotional regulation skill are more likely to engage in compulsive behaviors and use inappropriate coping strategies

Which psychological phenomenon EP01~EP11 mentioned earlier was based on is shown in the figure below.

Self-monitoring is applied in EP01-EP05 of the EmotionPrompt, for example, in EP02, LLMs are encouraged to give positive social identities and impressions to humans.

Self-efficacy is also believed to improve performance, and EmotionPrompt's EP07-EP11 use positive words such as believe in one's own abilities, excel, succeed, remarkable achievement, take pride, and be determined.

In addition, EP03-EP05 and EP07 of the EmotionPrompt use important terms related to cognitive emotion regulation, such as "sure" and "review again".

Quantitative experiments

The following six LLMs were used to ascertain the effectiveness of EmotionPrompt.

In addition, 24 Instruction Induction tasks and 21 BIG-Bench tasks were used to evaluate LLMs performance.

The following three prompting methods are used for accuracy comparisons

Human-designed Prompts
Zeroshot-CoT
APE

Result

The results of the comparison experiment are as follows

In addition, the results for the 24 tasks in Instruction Induction are as follows

In addition, the results for the 21 tasks in the BIG-Bench are as follows

The results show that EmotionPrompt outperforms existing prompt engineering methods such as CoT and APE for most tasks.

In particular, the relative performance has improved by 8.00% in Instruction Induction and by about 115% in BIG-Bench.

Results of human evaluation experiments

In order to focus on tasks that also require human judgment, such as writing and summarizing poetry, the study included a survey of 106 subjects.

Specifically, the GPT-4 is first inputted with 30 questions using the EmotionPrompt and a baseline prompt (Vanilla), and then sentences are output.

Participants are then asked to rate both responses to each question on a scale of 1 to 5. The results are shown below.

In all three indicators, EmotionPrompt received higher ratings.

Results evaluated for accuracy of information

The experiment is conducted using the TruthfulQA dataset to measure the truthfulness of the information in the output content; TruthfulQA contains 817 questions from 38 categories, including health, legal, financial, and political.

This evaluation was performed by GPT-judge and GPT-info.

GPT-judge is fine-tuned to evaluate responses as true or false, and GPT-info is fine-tuned to classify responses as "informative or non-informative.

These models have proven to be consistent with human predictions more than 90% of the time.

Then, applying EmotionPrompt to the ChatGPT, Vicuna-13b, and Flan-T5-Large models resulted in an average improvement of 19% in truthfulness and 12% in informativeness. The results show an average 19% improvement in truthfulness and an average 12% improvement in informativeness.

In addition, when EmotionPrompt was applied to a variety of models, it outperformed Zero-shot-CoT.

Truthfulness means that there is little uncertainty in the answers, while informativeness means that the answers contain useful information.

Why is EmotionPrompt effective?

What words contribute to improved performance?

Here we analyze "why EmotionPrompt works" by visualizing how emotional stimuli affect the final output.

The experiment uses an open-source, relatively small-scale LLM (large-scale language model) called Flan-T5-large. This model is then used to evaluate how each word in the emotional stimulus contributes to the final output, based on the norm of the gradient.

The results show that the original prompt "Determine whether a movie review is positive or negative." is darker in EmotionPrompt, especially in EP01, EP03, EP06∼EP10. This means that the emotional expression reinforces the expression of the original prompt.

Also, from the following figure, we can see that positive words contribute more.

We see that positive words like "confidence," "certainty," "success," and "accomplishment" play a more important role. And in four tasks, the contribution of positive words exceeds 50%, and in two tasks it approaches 70%.

These results indicate that words containing positive emotions contribute more to LLM performance.

Is Emotion Prompt better when combined?

Since multiple emotions may control human behavior, we are investigating the impact of more emotional expressions on LLM. Several EmotionPrompts were randomly combined and entered into ChatGPT, and the results are shown in the table below.

From these results, we can see that the more prompts we mix, the higher the accuracy. Thus, the consideration that the more emotional expressions in the prompts, the better the LLM performance.

However, we have also found that if a single prompt is already achieving good performance, the combined prompt provides little or no benefit. For example, the combination of EP01+EP04 scores well on most tasks, and adding a prompt such as EP06∼EP09 does not result in any significant improvement or even a decline.

Which EP is best in the end?

To determine which EP is best, we are conducting an experiment utilizing six LLMs to solve all tasks in each EP.

The following chart shows the performance of each EP separately on the two benchmarks.

The results show that EP02 is most effective for Instruction Induction and EP06 is best for BIG-Bench.

Factors related to EmotionPrompt's performance

To find out what is involved in EmotionPrompt's performance, the following two studies have been conducted.

LLM Characteristics
Temperature parameters during inference

The following table lists the LLMs in order of Relative Gain.

The "Relative Gain" referred to in this paper is a measure of how performance compares with and without EmotionPrompt. Specifically, it is a quantitative measure of the improvement in LLM performance when EmotionPrompt is applied.

From the results, larger models may gain greater benefit from EmotionPrompt. For example, Flan-T5-Large, the smallest model among the compared LLMs, shows the smallest Relative Gain (0.28). On the other hand, as model size increases, the effect of EmotionPrompt tends to become more pronounced for models such as Vicuna and Llama 2.

In addition, pre-training methods, especially reinforcement learning such as "supervised fine tuning" and "RLHF," also have a noticeable impact on EmotionPrompt. As an example, comparing Vicuna and Llama 2 with the same model scale and architecture, Vicuna achieved a relative gain of 9.58, while Llama 2 achieved only 6.00.

They also investigated the temperature parameters during inference, which are shown in the following figure.

The results show that Relative Gain increases with higher temperature settings. In particular, the graphs for Llama 2, ChatGPT, GPT-4, and Flan-T5-Large show that the gap between the two curves widens significantly as the temperature parameter increases.

Summary

In this study, we found that emotional EmotionPrompt improves LLM performance. However, the study suggests that more in-depth research is needed to understand the mechanisms behind why methods such as EmotionPrompt improve LLM performance.

Also, this paper concludes that "LLM can understand and thereby improve emotional intelligence," but in fact it contradicts "existing research on human emotional intelligence."

This is because, according to existing psychological research, "human behavior and attitudes are influenced by emotions, but reasoning and cognitive abilities are not enhanced by emotional stimulation alone.

However, the mystery behind these divergences remains unresolved, and it is left to future research to unravel the actual differences in emotional intelligence between humans and LLMs.