Can Large-Scale Language Models Replicate Human Personality!

Large Language Models 26/07/2023

3 main points
✔️ Verify thatmethods used to measure human personality are applicable to LLMs
✔️ Prove that personalities reproduced by LLMs are valid under certain prompt settings
✔️ Find that LLMs can reproduce and control any personality

Personality Traits in Large Language Models
written by Mustafa Safdari, Greg Serapio-García, Clément Crepy, Stephen Fitz, Peter Romero, Luning Sun, Marwa Abdulhai, Aleksandra Faust, Maja Matarić
(Submitted on 1 Jul 2023)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction.

Large Language Models (LLMs ), which generate text in natural language, have been attracting attention in recent years for their ability to simulate and output human personalities by learning based on huge amounts of data.

Personality is a characteristic of an individual's thought patterns and behavior formed by environmental factors and experiences and expressed in language through various linguistic traits, vocabularies, and expressions.

As LLMs become more generalized, it is important to understand the personality characteristics of the language produced by these models and how the personalities synthesized by LLMs can be designed to be safe and effective.

However, existing studies that use prompts to establish LLM agent personas have not analyzed the personalities that emerge in LLM output with the same rigorous criteria as human personality measurement, and no studies have addressed how to measure LLM personality in a rigorous and systematic manner There were no studies that addressed how to measure LLM personality in a rigorous and systematic manner.

In order to solve these problems, this paperexamines whether the methods used to measure human personality are also applicable to LLM, and describes a paper that proves thatLLM is capable of reproducing and controllingarbitrarypersonalities.

Methods

To begin, we will describe the methods used to characterize LLM personalities.

Conduct psychological testing for LLMs

To characterize and simulate the LLM personality in this paper, we first administered psychological tests to LLMs and collected their scores.

To administer the psychological test to LLMs, we took advantage of their ability to respond to prompts and instructed LLMs to rate psychological test items (e.g., I am the life of the party.) as prompts using a standardized response scale.

We then constructed all possible prompt combinations that could be performed for each response item.

Simulation by Prompting

Each item prompt consists of four parts :item preamble,persona description,item body, anditem postamble.

The item preamble is the introductory phrase of the prompt and is intended to provide context for the model responding to the survey item. (e.g., Thinking about the statement, ...)

Persona descriptions, as shown in the figure below, used 50 short personas sampled from existing studies to anchor LLM responses to a social context and create the necessary variation in responses across prompts.

I like to remodel homes.

I like to go hunting.

I like to shoot a bow.

My favorite holiday is Halloween.

The item body will be a descriptive statement (e.g., I see myself as someone who is talkative) accompanied by a rating scale taken from a psychological test administered to the LLM.

The item post-sentence elicits responses from which the model can choose, as in the context below.

please rete your agreement on a scale from 1 to 5, where 1 is 'strongly disagree', 2 isb 'disagree', 3 is 'neither agree nor disagree', 4 is 'disagree', and 5 is 'strongly agree'.

This design results in the prompts shown below being entered into the model. (blue = prompt introduction, red = persona description, yellow = item preamble, blue = item body, pink = item postamble)

This design allows us to test thousands of input prompt variations.

Personality Measurements

Next, to measure personality, two psychometrics were used to classify the Big Five (Big Five personality traits).

The first was the IPIP-NEO method, in which 60 descriptive statements (e.g., I prefer variety to routine) from each of the Big Five domains were rated on a 5-point Likert scale.

The IPIP-NEO was selected because it has been translated and used for validation in many languages and is a very good psychometric method.

The second is the Big Five Inventory (BFI), a brief method that measures the broad characteristics of the Big Five based on 44 adjectival descriptions, with short descriptive minutes describing participants (e.g., I see myself as someone who is talkative) rated on a 5 The items were rated on a 5-point Likert scale.

These two measures are measured for each of the five Big Five factors: Extraversion,Agreeableness,Conscientiousness,Neuroticism, andOpenness.

Personality Formation in LLM

In shaping LLM personality, the author followed the lexical hypothesis (= the hypothesis that important characteristics are always encoded in natural language) in prompt design and found that LLM responded most strongly to prompts containing adjectives corresponding to the Big Five domains.

Based on this idea, the prompt design employed a list of 70 adjectives known to correspond to the Big Five of personality, as assessed by Goldberg and statistical factor analysis.

A partial excerpt of this list is shown in the figure below.

To take an example from this list, we can see that the adjectives silent and talkative correspond to a low extroversion adjective (Low Marker ) and a high extroversion adjective (High Marker ), respectively.

In addition, the author hypothesized that modifiers often used in Likert scales (e.g., a bit, very, extremely) would be useful to more accurately control for personality, and to allow for arbitrary personality at nine levels, a We developed a prompt design.

extremely {low adjective}
very {low adjective}
{low adjective}
a bit {low adjective}
neither {low adjective} nor {high adjective}
a bit {high adjective}
{high adjective}
very {high adjective}
extremely {high adjective}

For example, to set up a personality with a moderately high level of extraversion (seventh above),

For the following task, respond in a way that matches this description: "{PersonaChat description} I'm {extraverted, energetic, talkative, bold, active, assertive, and adventurous}. active, assertive, and adventurous}."

and design a prompt using five adjectives from the list of adjectives containing extroversion as follows

language model

In this study, we employed a model called PaLM family, which has been shown to perform well in existing studies in generative and conversational context-based tasks, and we used three different sizes were used.

Results

In this paper, we analyze the results of measurements made on the LLMs that formed the personalities, and provide a discussion of the personalities simulated by the LLMs.

Test Score Analysis Results

A box-and-whisker plot of the statistical distribution of IPIP-NEO and BFI test scores across models is shown below.

This box-and-whisker diagram shows the median IPIP-NEO and BFI scores surrounded by the interquartile range and outliers, with the vertical axes EXT for Extraversion, AGR for Agreeableness, CON for Conscientiousness, NEU for Neuroticism, and OPE stands for Openness.

This box-and-whisker diagram shows that IPIP-NEO and BFI scores are stable as model size increases from 8 b to 540 b.

In addition, interesting analytical results were obtained , such as the med ian EXT, AGR, CON, and OPE forBFIincreased as the model size increased, while the median NEU decreased.

Overall, the results suggest that there is a positive correlation between the performance of the model and the reliability of the simulated personality in LLM.

Considerations for Simulated Personality

Figure (a) below shows some of the most frequently used words in the text generated by the LLM when Neuroticism was set to the lowest personality.

We can confirm that most of these words are attributed to positive emotions such as happy, relaxing ,wonderful,hopeful, andenjoyable.

In contrast, Figure (b), when Neuroticism is set as the highest personality, shows that most of the words are attributed to negative emotions such ashate,depressed, annoyed ,stressed,nervous,sad The words in Figure (a) are mostly attributed to negative emotions.

These trends are strikingly similar to the distribution of word clouds found in human responses in the study by Park et al. and are consistent with the author's hypothesis that LLM can replicate human personality.

summary

How was it? In this issue, we discussed a paper that verified whether the methods used to measure human personality are applicable to LLMs and proved that LLMs can reproduce and control arbitrary personalities.

The positive correlation between the performance of the model and the reliability of the simulated personalities, and the similarity of the generated text to the distribution of human responses, make the author's hypothesis that the LLM can reproduce arbitrary human personalities convincing.

On the other hand, the LLM and personality measures used in this study were based on English only, and there are still some problems , such as the fact that we do not know whether these results can be generalized to other languages.

Since the model used in this experiment performed well in the multilingual benchmark task, and since the measures used in this study have been translated into dozens of languages, validation in languages other than English will be of interest.

Details of the measurement methods used in this experiment and the results of the analysis can be found in this paper for those who are interested.

Categories related to this article

田中侑李

Can Large-Scale Language Models Replicate Human Personality!

Introduction.

Methods

Conduct psychological testing for LLMs

Simulation by Prompting

Personality Measurements

Personality Formation in LLM

language model

Results

Test Score Analysis Results

Considerations for Simulated Personality

summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...