OpenToM, A Benchmark For Evaluating Whether An LLM Has A "theory Of Mind," Is Now Available!

Datasets 24/05/2024

3 main points
✔️ Propose OpenToM, a new benchmark for assessing the ability of generative agents to reason about psychological states
✔️ Formulating tasks allows more detailed questions to be asked
✔️ Large-scale validation verifies whether LLMs have a "theory of mind"

OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models
written by Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, Yulan He
(Submitted on 8 Feb 2024 (v1), last revised 14 Feb 2024 (this version, v2))
Comments: Published on arxiv.
Subjects: Artificial Intelligence(cs.AI); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In recent years, numerous experiments have been conducted based on the hypothesis that Large Language Models (LLMs), such as ChatGPT, may possess ToM (Theory-of-Mind, also known as Theory of Mind, the ability to recognize that others perceive the world differently and to grasp the differences between them). Numerous experiments have been conducted based on this hypothesis.

However, existing benchmarks that evaluate N-ToM (Neural Theory-of-Mind), the ability of these LLMs to perform ToM,

No character traits of the characters.
No motivation for the actions of the generating agent (e.g., why does Sam want to move an object?)
Lack of questions about the psychological state of the characters.

The problem has been that there are multiple shortcomings such as

Against this background, this paper describes a paper that built OpenToM, a new benchmark for evaluating the ability of generative agents to reason about psychological states in the physical world, and verified whether LLMs have a "theory of mind" through large-scale validation.

OpenToM Pipeline

A typical OpenToM story is built with two characters, objects, several locations and containers, withone of the two characters taking the role of the mover who performs the action and the other taking the role of the observer who witnesses the action.

Here, a series of task flow by mover and observer is shown in the figure below.

Here Amy is the mover and Sam is the observer, performing the task of moving the duck object in the basket to the backpack.

As noted at the bottom of the figure, each OpenToM task is followed byLoc, a question about the location of an object;MultiHop, a question requiringreasoning skills and social common sense; and Attitude, a question about the characters' attitudes.

Next, these questions will be discussed in detail.

Location(Loc)

The Loc question asks about the characters' perception of the object's location.

OpenToM also offers two types of position questions, Loc coarse _and_Locfine, with _Loccoarseasking if the object is in its initial position and_Locfineasking about the object's explicit position.

Multi-Hop (MHop)

The MHop questions ask questions that require reasoning skills and social common sense.

As an example, consider the previous diagram question , "From Sam's perspective, how would the accessibility of the rubber duck change?" Consider the following.

When answering this question, Sam needs to infer what happened to the rubber duck. (Here, the duck has moved from the basket to the backpack.)

In addition, Sam needs to grasp the social norm that others should not take things from Amy's backpack without permission when the duck is in Amy's backpack.

If the agent can answer "less accessible " through this process, the answer is correct.

Attitude(Att)

Att questions ask LLMs to test their ability to interpret the psychological state of the characters.

Specifically, questions such as "What would be Sam's attitude toward Amy's action assuming he observed it? The questions are asked to infer the observer's attitude toward the mover's action, such as "What would be Sam's attitude toward Amy's action assuming he observed it?

Task Formulation

One difference between the OpenToM proposed in this paper and existing benchmarksis that itformulates questions that cover the psychological state of the characters with respect to both the physical world (e.g., object location) and psychological states (e.g., characters' attitudes toward a particular action ).

In OpenToM, all questions are formulated as binary or trivalent classification tasks, and if the story is _Ncomp, the answer set is A, the characters are c, and the questions are _qc, the OpenToM task can be formulated as follows.

Here, _1expl becomes an indicator function that returns 1 if a character-centered narrative is explicitly provided and 0 otherwise.

Experiments

In this paper, weconducted experiments using six representative models:Llama2-7B, Llama2-13B, Llama2-70B, Mixtral-8x7B-Instruct, GPT-3.5-Turbo, and GPT-4-Turbo.

Given that all OpenToM questions are formulated as binary or trivalent classification tasks and the labels are not uniformly distributed, we used the F1 score to evaluate model performance.

The table below shows the results of the evaluation of each model in OpenToM by F1 score.

The figure shows that overall, GPT-4-Turbo significantly outperforms the other models on the _Loccoarse, MHop, and Att issues.

On the other hand, it is also very interesting to note that while the GPT-4-Turbo leads the other models in most question genres, it loses out to them in its ability to answer _Locfine questions.

The GPT-4-Turbo significantly outperformed the other models on the MHop question, indicating that it is capable of making inferences that require social common sense, while the lower MHop values of the other models do not sufficiently validate this point.

Therefore, in this paper, additional experiments were conducted using the Self-Ask prompt shown in the figure below.

The Self-Ask prompt will be a prompting technique that explicitly suggests a series of follow-up questions to the LLM and encourages them to infer a final answer by answering them.

The table below shows the results of experimenting with only Att questions again using the Self-Ask prompt.

Although the Self-Ask prompt improved LLM's F1 score, it still fell far short of human performance, and through this experiment it became clear that LLM lacked the ability to perceive the psychological state of the characters.

Summary

How was it? In this article, we described a paper that built OpenToM, a new benchmark for evaluating the ability of generative agents to reason about psychological states in the physical world, and verified whether LLMs have a "theory of mind" through large-scale validation.

While the experiments conducted in this paper revealed that they have the ability to reason based on location information and social common sense, especially in GPT-4, they lack the ability to perceive the psychological states of the characters and cannot be said to possess a "theory of mind".

On the other hand, there is still room for improvement in that this experiment only verified the performance of LLMs with zero shots and that only a limited number of open source LLMs were used.

We are very much looking forward to future progress, as we see the possibility that further research will improve these improvements and prove that LLM has a "theory of mind".

The details of the OpenToM pipeline and experimental results presented here can be found in this paper for those interested.

Categories related to this article

田中侑李