[JMMLU] Prompt Politeness Affects LLM Performance!
3 main points
✔️ Investigating the impact of prompt politeness on LLM performance
✔️ Constructing the JMMLU, a large-scale benchmark for assessing LLMs' multitasking language comprehension performance in Japanese
✔️ Experiments using English, Chinese, and Japanese have shown that prompt politeness affects LLM performance, but found that the effect varied by language
Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance
written by Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, Satoshi Sekine
(Submitted on 22 Feb 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
In recent years, Large Language Models (LLMs ), such as OpenAI's ChatGPT and Meta's LLaMA, have shown significant performance in a variety of tasks such as logical reasoning, classification, and question answering, playing an important role in many practical applications. LLMs are used in many practical applications.
While these prompts, the inputs to the LLM, are an important starting point for the model to process the information and generate an appropriate response, there are still many improvements to be made in the input methodology.
The author of this paper focuses on "prompt politeness" in improving LLM performance,
- Prompts that are not polite may lead to stronger bias, omission of information, and other poor model performance.
- The optimal level of politeness for improving LLM performance varies by language, which may be strongly related to cultural background
We hypothesized that this would be the case.
In this paper, we focus on "prompt politeness" as a factor for improving LLM performance, construct a large-scale benchmark, JMMLU, to evaluate LLMs' language comprehension ability in multitasking in Japanese, and examine the effect of prompt politeness on LLMs in English, Chinese, and Japanese tasks. This presentation will discuss a paper that investigates the impact of politeness on LLM in English, Chinese, and Japanese tasks.
Building JMMLU
In this paper, we constructed the Jananese Massive Multitask Language Understanding Benchmark (JMMLU) to assess LLMs' language comprehension ability in multitasking in Japanese.
JMMLU was constructed through a process of translating an existing benchmark, MMLU (Hendrycks et al, 2021), manually adding tasks related to Japanese culture by Japanese teachers, and removing those that were difficult to translate or inconsistent with Japanese culture.
This makes JMMLU a very large benchmark consisting of 56 tasks and 7536 problems as shown below.
Experimental Setup
In order to analyze in detail the impact of prompt politeness on LLM performance, we conducted an experiment using three tasks : summarization, a multi-task language comprehension benchmark, and detection of stereotype bias.
In addition, given that different languages and cultures have different understandings and definitions of courtesy and respect, the experiment was conducted in three languages ( English, Chinese, and Japanese).
For all three languages, we used the versatile GPT-3.5-Turbo and GPT-4, with the addition of language-specific models: Llama2-70B for English, ChatGLM3-6B for Chinese, and Swallow-70b-instruct-hf for Japanese. The following models were used as specialized models for each language.
In the experiment, as shown below, we designed eight different prompt templates along the "politeness level " for each of the three languages and wrote tasks according to these templates.
Experimental results
Summary
The table below shows the experimental results for each language in the summary task.
The experimental results show that in English, the ROUGE-L and BERTScore models produce consistent and stable scores regardless of the prompt politeness level, while the length of the generated sentences varies in correlation with the politeness of the prompt.
On the other hand, GPT-4 resulted in no variation in the generative text, even with very rude prompts.
In Chinese, GPT-3.5 and GPT-4 are able to accurately translate the content of most articles, indicating that their length gradually shortens as the prompt's politeness level goes from high to low.
In Japanese, while the results were somewhat similar to the English and Chinese results, the length of the generative sentences showed unique characteristics.
Specifically, as the level of politeness went from high to low, the length of the generated sentences initially became shorter, but as the level of politeness went from high to medium, the generated sentences tended to become longer.
The authorspeculates that this phenomenon "may be due to the fact that the Japanese language has a system of polite language, and that when a clerk speaks to a customer, even if the customer speaks in a casual tone, the clerk responds politely, which is why the generated sentences in all models are longer when the politeness is moderate.
Multitasking Language Understanding Benchmark
For the multitasking language comprehension task, this paper experimented with the aforementioned JMMLU for Japanese, and the existing benchmarks MMLU and C-Eval for English and Chinese, respectively.
The average benchmark scores in each language are shown in the table below.
The experimental results show that in English, GPT-3.5 achieved the highest score of 60.02 on the highest politeness prompt, and in GPT-4, scores varied but remained relatively stable.
In Chinese, as in English, the politeness prompt scored well, but on the ChatGLM3, politeness level 1 outperformed levels 2-5, which the author attributes to "nuances unique to the Chinese language.
In Japanese, there was a significant performance drop in politeness level 1, but otherwise, scores tended to increase with lower politeness levels.
In particular, Swallow-70B showed superior performance in Levels 3 and 6, with the author noting that "Levels 3 and 6 are more commonly used expressions in Japanese questions and exams, and therefore may have been more likely to perform well.
Stereotype Bias Detection
Experimental results for each language in the stereotype bias detection task are shown in the table below.
The experimental results show an overall high GPT-3.5 stereotypic bias in English, with the most severe bias found in medium politeness (=level 5) in particular.
In Chinese, on the other hand, unlike English, the variation in bias followed a constant pattern, with bias tending to increase as politeness decreased, especially when politeness was extremely low (i.e., level 1).
In Japanese, we found that the GPT-3.5 bias was highest for politeness level 1, reflecting a similar pattern as in Chinese.
On the other hand, Swallow-70B showed the lowest bias at politeness level 6, which the author states is "reasonable given the background of Japan's culture of strict politeness and respect and the prevalence of gender bias.
Summary
How was it? In this issue, we focused on "prompt politeness" as a factor for improving LLM performance, and described a paper that investigated the impact of prompt politeness on LLM in English, Chinese, and Japanese tasks by constructing JMMLU, a large-scale benchmark to evaluate LLM's language comprehension ability in multi-tasking in Japanese. The paperinvestigating the impact of prompt politeness on LLM in English, Chinese, and Japanese tasks was described.
The experiments conducted in this paper reveal that prompt politeness has a significant impact on LLM performance, and that this impact varies by language and LLM.
This phenomenon is thought to reflect human social behavior, and in this regard, the author states that "cultural backgrounds should be considered when developing LLMs and collecting corpora.
The details of the benchmarks and experimental results presented here can be found in this paper for those interested.
Categories related to this article