Benchmarks Are Now Available To Evaluate How Well AI Agents Are Able To Capture The Implicit Intentions Of Users!
3 main points
✔️ Propose IN3 (Intention-in-Interaction), a new benchmark to evaluate how well agents understand users' implicit intentions
✔️ Integrate the XAgent framework into the existing model Mistral-Interact Designing interaction-specific agents by
✔️ Comprehensive experiments have confirmed that they can understand and summarize user intentions more than 96% of the time
Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents
written by Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun
(Submitted on 14 Feb 2024 (v1), last revised 15 Feb 2024 (this version, v2))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence(cs.AI); Human-Computer Interaction(cs.HC)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
In recent years,Large Language Models (LLMs)such asOpenAI GPT, LLaMA, and Mistral have made significant advances in generating high-quality text and code.
The distinctive feature of these models is that they allow the language model to interact with the outside world and receive feedback as an AI agent to assist the user with tasks, andvarious open source frameworks such asBabyAGI, AutoGen, and CAMELhave been developed Various open source frameworks such as BabyAGI, AutoGen, and CAMEL have been developed.
On the other hand, the following problems were pointed out with such frameworks
- The user's initial instructions to the agent system are vague and terse, thus failing to capture the user's intentions
- Despite the fact that multiple users have various intentions that require explicit queries and inspirations, the LLM is unable to account for these user interactions
These problems often lead to "fake success " in agent task execution, where the task appears to have been accomplished but deviates significantly from the user's true intentions.
However, existing agent benchmarks usually assume success in a given task and do not take into account the ability to understand the user's intentions, an important aspect for evaluation.
Against this background, this paper describes a paper proposing a new benchmark, IN3 (Intention-in-Interaction), which aims to evaluate how well an agent understands the user's implicit intentions through explicit task ambiguity judgments and user queries. The paperwill discuss the IN3 (Intention-in-Interaction ) benchmark.
Intention-in-Interaction Benchmark
Previous agent benchmarks assumed that the given task was clear and were intended to evaluate the agent's ability to perform the task.
For example, in the task " Locate the best yoga class in my city," the question arises as to where "my city" is located and what the criteria for "best" are. and what are the criteria for "best"?
To solve these problems, agents need to proactively query for missing details and understand the implicit intentions of the user.
In this paper, we propose IN3 (Intention-in-Interaction ) as a benchmark for assessing the ability of LLMs to clearly understand these user intentions.
An overview of IN3 is shown in the figure below.
As shown in the figure, in IN3, the model iteratively generates new tasks to augment the dataset using Seed Tasks drawn by humans as Step 1.
At the same time, sample the demonstration as a new example to run the next generation round from the data set in Step 2.
Then, in Step 3, with the help of GPT-4, human annotation is performed for the ambiguity, missing details, importance of each detail and potential alternatives for each task.
By following these steps, IN3 provides a variety of agent tasks across hundreds of categories, such as cooking, arts, programming, etc., and allows for three levels of annotation for the importance of the information, whether the task is clarified or, if vague, missing.
Using the previous question as an example, IN3 provides details about the city where the user lives, with annotated criteria for best, and asks the model for possible answer choices and the user's true intent choices.
Method
Along with the IN3 proposal, this paper proposes a new method of incorporating models upstream of agent design to enhance user-agent interaction.
Specifically, an interaction-specific model called Mistral-Intract was incorporated into the XAgent framework, an autonomous agent system for solving complex tasks, to create a powerful model that understands the specific intentions of the user.
Metrics
This paper also proposes a new evaluation metric to translate human subjective intentions in user-agent interactions into objective numbers, as shown below.
- Vagueness Judgement Accuracy: Calculates the percentage of times the model's judgment of a task's vagueness matches the correct answer.
- Missing Details Recover Rate: Calculates what percentage of the details of questions of different importance were queried by the model during the interaction.
- Summary Intention Coverage Rate: Calculates what percentage of the intentions provided by the user are ultimately explicitly summarized by the model.
In this paper, we conducted experiments on IN3 using the aforementioned methods and the evaluation indexes described above.
Experiments
The experiments conducted in this paper compared the aforementioned model integrating the Mistral-Interact and XAgent frameworks with the existing models LLaMA-2-7B, Mistral-7B, and GPT-4.
The results of the experiment are shown in the table below.
The table confirms that of all the open source models, Mistral-Interact performs the best.
In addition, the Summary Intention Coverage Rate values in the table confirm that more than 96% of the user's intentions were adequately summarized, demonstrating a particularly good ability to provide a comprehensive summary based on the user's intentions.
Summary
How was it? In this issue, we discussed a paper that proposed IN3 (Intention-in-Interaction), a new benchmark that aims to evaluate how well an agent understands the implicit intentions of the user through explicit task ambiguity determination and user queries. Explanation.
Through the experiments in this paper, we were able to demonstrate the effectiveness of IN3 and the proposed model, especially that the proposed model is able to capture user intentions very adequately.
On the other hand, some technical improvements have been made, such as allowing the model to simulate the user's unique tone of voice (angry, calm, etc.) and response style (concise, verbose, etc.), and allowing the model to access the user's past conversation history, which may allow for a more detailed representation of personal preferences. The model may be able to simulate information (e.g., concise redundancy).
With these improvements, we are very much looking forward to future developments, as LLMs may in the future go beyond the role of supporting the user and become a replacement for the user itself.
The details of the benchmarks and experimental results presented here can be found in this paper for those interested.
Categories related to this article