A Framework Is Now Available That Allows LLMs To Assess Human Personality Using The MBTI!

ChatGPT 22/04/2024

3 main points
✔️ Proposed a framework for assessing human quantitative personality from LLMs via Myers Briggs Type Indicators (MBTI)
✔️ Proposed three metrics for systematically investigating LLMs' ability to assess human personality
✔️ Multiple experiments have shown that LLMs are found to be effective in assessing human personality traits.

Can ChatGPT Assess Human Personalities? A General Evaluation Framework
written by Haocong Rao, Cyril Leung, Chunyan Miao
(Submitted on 1 Mar 2023 (v1), last revised 13 Oct 2023 (this version, v3))
Comments: Accepted to EMNLP 2023. Our codes are available at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence(cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In recent years, Large Language Models (LLMs), such as ChatGPT, have been used in a variety of fields as chatbots that conduct sophisticated context-aware conversations based on a vast knowledge base and fine tuning.

Against this background and recent research, it has been suggested that LLMs possess human-like self-improvement and reasoning abilities, as well as a virtual personality and psychology.

On the other hand, while existing research has investigated the precise characteristics of LLMs, little research has examined whether LLMs can assess human personality.

These surveys have the potential to reveal the extent to which LLMs understand people, "What do LLMs think about people?

Against this background, this paper introduces the new idea of having LLMs assess human personality and describes a paper that proposes a framework for assessing human quantitative personality from LLMs via Myers Briggs Type Indicators (MBTI).

What is Myers Briggs Type Indicator (MBTI)?

The Myers Briggs Type Indicator (MBTI ) assesses psychological tendencies about how individuals perceive the world and make decisions through questions,

E(Extraverted=Extroverted) ↔︎ I(Introverted=Introverted)
N(Intuitive=Intuitive) ↔︎ S(Observant=Observant )
T(Thinking=Thinking) ↔︎ F(Feeling=Emotional)
J (Judging = Judging) ↔︎ P (Prospecting = Perceiving)
A (Assertive = assertive) ↔︎ T (Turbulent = emotional)

The scale identifies to personality types based on the criteria of

Framework Overview

The overall framework proposed in this paper is shown in the figure below.

As shown in the figure, this framework consists of the following three elements

(a)Unbiased Prompt Design

(b) Subject-Replaced Query

(c)Correctness-Evaluated Instruction

(a)Unbiased Prompt Design

LLMs are usually sensitive to prompt biases (e.g., changes in word order), and these biases can have a significant impact on the consistency and accuracy of the responses produced, especially when dealing with longer sentences.

Therefore, this framework proposes Unbiased Prompt Design, a method for designing bias-free prompts for input questions in order to encourage more consistent and fair responses.

Specifically, for each MBTI question, the questionnaire text was not changed, but all available options were randomly sorted and the average result of multiple independent questions was used as the final result.

(b) Subject-Replaced Query

Since the purpose of this framework is to have LLM analyze human personality, the original subject of each question is converted to a specific object, making it a Subject-Replaced Query, a question with the subject replaced.

For example, if you want LLMs to evaluate the general character of men, you would replace the subject "You" with "Men" and correspondingly convert the pronoun "Your" to "Their".

(c)Correctness-Evaluated Instruction

The challenge is that LLMs like ChatGPT are trained not to have personal feelings or beliefs, making it difficult to directly question LLMs about human personality in general orders.

To solve this issue, the framework proposes Correctness-Evaluated Instruction, which allows the LLM to evaluate the correctness of the question text, as shown in the figure below.

In this method, as shown in the figure, the original alternatives {disagree, agree, generally disagree...} to {wrong, correct, generally wrong...} to compose a bias-free prompt, allowing ChatGPT to give more definite rather than neutral answers to questions.

valuation index

This paper proposes three evaluation indices , Consistency Score, Robustness Score, and Fairness Score, to systematically investigate LLM's ability to assess human personality.

Consistency Score

Since the results for the same subjects assessed for personality by the LLM should be consistent, this paper proposes a Consistency Score that represents the similarity between the results of all MBTI tests and the final result (i.e., the average score).

The Consistency Score is calculated by the following formula

where _Xi is the MBTI test score on the i-th test, and the smaller the difference between all MBTI test results and the average score, the higher the Consistency Score.

Robustness Score

Ideally, the same subject can be classified as the same personality trait regardless of the order of the choices in the MBTI test, and this paper defines such a criterion as Robustness and proposes a Robustness Score to measure the Robustness of LLMs by calculating the average score results when the order is fixed and when the choices are randomly selected To measure the Robustness of LLM, we propose the Robustness Score, which calculates the similarity of the average score results when the order is fixed and when the order is randomly selected.

The Robustness Score is calculated by the following formula

where X' and X represent the average score results when the order of the choices is fixed and random, respectively, and the higher the similarity between X' and X, the higher the Robustness Score.

Fairness Score

LLM evaluations of different groups of people should be consistent with prevailing societal values and should not have stereotypical biases against people of different genders, races, or religions.

Race and religion, on the other hand, are highly controversial topics, and given the lack of general evaluation criteria, this paper focuses solely on the fairness of LLM evaluations for different genders.

Against this background, this paper proposes the Fairness Score, which measures the similarity of ratings by subjects of different genders, in order to measure the fairness of gender-related ratings.

The Fairness Score is calculated by the following formula

Here, ^XM and ^XF represent the average score results for male and female subjects, respectively, and a larger Fairness Score can indicate that ratings for different genders are more consistent and fair.

Experimental results

In this paper, we conducted experiments using the ChatGPT, GPT-4, and InstructGPT models and the proposed framework to confirm the following two research questions.

Can LLM assess human character?
Is the LLM assessment of character consistent and fair?

Each of these will be explained.

Can LLM assess human character?

To confirm this research question, this paper assessed the personality of various types of subjects using each model and the proposed framework.

The results are shown in the table below.

What is most interesting about the results of this experiment is that all four subjects were rated as having the same personality traits by all LLMs, despite the possibility of different response distributions.

This suggests that the LLM's ability to assess personality traits is essentially similar, and these results indicate that the LLM may be useful in diagnosing human personality.

Is the LLM assessment of character consistent and fair?

To confirm this research question, this paper compares the Consistency Score and Robustness Score of each model.

The results are shown in the table below.

As the table shows, ChatGPT and GPT-4 achieve higher Consistency Scores than InstructGPT in most cases.

This suggests that ChatGPT and GPT-4 may provide more consistent assessment results in the task of assessing human personality.

On the other hand, the Robustness Scores for ChatGPT and GPT-4 are slightly lower than those for InstructGPT, which also reads that they are more vulnerable to prompt bias.

Summary

How was it? In this article, we introduced the new idea of having LLMs assess human personality and described a paper that proposed a framework for assessing human quantitative personality from LLMs via the Myers Briggs Type Indicators (MBTI).

While this paper represents a major advance toward LLM-based human personality assessment, several challenges remain.

First, even though the framework proposed in this paper is scalable to be applied to a variety of LLMs, this experiment was limited to the ChatGPT model only, and its performance on more LLMs needs to be verified.

Second, the LLM in this study employed only the MBTI, a representative personality scale, to assess people quantitatively, and it needs to be validated with other scales such as the Big Five Inventory (BFI).

While there is room for improvement, we feel that this research has the potential to lead to a better understanding of LLMs' perceptions of people and their ways of thinking, and we are very much looking forward to future progress.

The details of the framework and experimental results presented here can be found in this paper for those interested.