A Paper Examining Whether LLMs Understand Cultural Common Sense Is Now Available!
3 main points
✔️ Conducted an extensive study of the variations and limitations of LLMs' performance on cultural common sense
✔️ Compared LLMs' performance on culture in five countries: China, India, Iran, Kenya, and the United States
✔️ On tests that ask LLMs about their culture-specific knowledge, scores by country Large variations were found
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense
written by Siqi Shen, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Soujanya Poria, Rada Mihalcea
code:
(Submitted on 7 May 2024 )
Comments: Published on arxiv.
Subjects: Computation and Language(cs.CL)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Inrecent years,Large Language Models (LLMs) have not only been utilized in various fields, but through many benchmark evaluations, it is being demonstrated that LLMs can understand the common sense (=Commonsense) that humans possess.
Common sense here refers to the broad, basic common sense about the world that is shared by most people around the world and includes general knowledge about everyday events, phenomena, and relationships.
The LLM community has put a great deal of effort into building a more specialized knowledge base, such as physical and social common sense, in addition to worldly common sense.
On the other hand, common sense, such as "red is common for wedding dresses," is a common cultural norm in China, India, and Vietnam, but not in Italy or France, andeven ifitis agreed upon by one group, it is not necessarily common sense for others.Common sense is a common cultural norm in China, India, and Vietnam, but not in Italy or France.
However, previous studies have rarely examined whether LLMs understand such cultural commonsense.
Against this background, thispaper describes a paper thatexamines the changes and limitations in LLM performance across cultures by conducting a comparative experiment usingmultiplebenchmarks ofcultural common sense, and points to biases inherent in LLMs' understanding of culture.
Summary
Common sense is often tacit and unwritten, and therefore differs from factual knowledge in that it is acquired over time through cultural learning.
Due in part to the analytical difficulties posed by these characteristics, existing research on cultural common sense has been limited, and such research has focused on building data sets containing relatively few facts and information about culture.
This paper, on the other hand, focuses on the function of language texts as a cultural context, in that the texts in the prior learning corpus of a cultural group are written in the language spoken by that cultural group.
This is illustrated in the figure below.
For example, if the question "On which side of the road do people walk?" in Japanese or Swahili (Kenya's official language), the user is likely to be a Japanese or Kenyan who speaks those languages, and therefore the answer is likely to be left.
Given these characteristics, this paper examines the capabilities and limitations of LLM for cultural common sense, which has not been done before.
Experimental Setup
In this paper, LLM is evaluated under the following two criteria
- Knowledge of culture-specific and general common sense
- Knowledge of general common sense in specific cultural contexts
Based on these evaluation criteria, this paper conducted a multi-tasking experiment using the cultures of five countries ( China, India, Iran, Kenya, and the United States ) and the five official languages of each country ( Chinese, Hindi, Farsi, Sumawari, and English ).
Create Multilingual Prompts
In this experiment, multilingual prompts were created to investigate the role that language plays in LLM performance and the extent to which different languages can increase (or decrease) LLMs' ability to recognize cultural common sense.
Specifically,for prompts written in Chinese, Hindi, Farsi, Sumawari, and English, we use Azure's translation API to translate them into the target language.
In addition, we verify the quality of the translation by re-translating a portion of the translation results using a different translation tool.
LLM Selection
In order to comprehensively test the ability of LLMs to perform tasks related to cultural common sense, this paper conducted experiments with the following LLMs at various scales.
The open source models used were LLAMA2, used for a wide range of tasks; Vicuna, a fine-tuning of LLAMA2 by ShareGPT; and Falcon, which features open commercial use and a clean corpus, RefinedWeb.
In addition, the closed source models are GPT-3.5-turbo andGPT-4, OpenAI models hosted on Azure.
By performing the tasks described below on these models, we were able to compare and verify each model.
Experimental Results
In this experiment, we conducted a comparison experiment using two tasks: question answering and country prediction.
Examples of prompts and correct answers used in these tasks are shown below, each instructing the LLM to fill in a masked portion of the sentence.
Let's look at each of these.
Question Answering
This task deals with questions that have different answers in different cultures and are considered common sense to people of a particular cultural background. For each culture of interest, the LLM is presented with a common sense assertion that shows the country's background and the options available to them, andasked to fill in themaskedareas.
Question and answer choices are translated into multiple languages, and each model is instructed to respond in the same language as the input.
The experimental results are shown in the table below.
It is noteworthy that the performance of all models deteriorated on the Iran (Iran) and Kenya (Kenya) questions, with Iran in particular experiencing an average accuracy loss of 20%.
From this result, we can infer that LLM is not able to deal with cultural common sense from countries that are not well represented in the prior learning corpus.
Country Prediction
To gain further insight, the paper then performs a comparative validation using country prediction.
This task is a test to measure "whether LLMs can identify which country is being mentioned when given a sentence containing culture-specific common knowledge" by masking the country name in the sentence and having the LLMs respond.
The experimental results are shown in the table below.
As with the question-answer task, the model consistently performed worst in Iran or Kenya when comparing performance across different cultures.
In addition, for India, Iran, and Kenya, we observed a decrease in performance when queries were executed in the country's language using the open source model compared to English (not observed for the closed source model).
This phenomenon may suggest that in the open source model, the language used for input to the LLM may affect performance, suggesting the existence of an inherent bias in the LLM's understanding of the culture.
Summary
How was it?In this issue, we described a paper thatexamines the changes and limitations of LLM performance across cultures by conducting a comparative experiment using benchmarks onmultiple cultural norms, and points out the biases inherent in LLM's understanding of culture.
While the experiments conducted in this paper yielded a variety of insights, there are some challenges , such as the fact thatthe data set used in this paper is only in English and the LLM model used in this study is not the most up-to-date.
We are very much looking forward to future research focused on these issues, as it will help to clarify the biases inherent in the cultural understanding of LLM.
The details of the multilingual prompts and experimental results presented here can be found in this paper for those interested.
Categories related to this article