AgentBench, A Comprehensive Benchmark For Evaluating AI Agent Performance, Is Now Available!
3 main points
✔️ Proposed AgentBench, a comprehensive benchmark for evaluating agents generated by large-scale language models
✔️ Conducted a large-scale comparison experiment using 25 different large-scale language models with a task consisting of 8 different environments and datasets
✔️ Results of the experiment show that API-based and Found large performance differences between open-source large-scale language models
AgentBench: Evaluating LLMs as Agents
written by Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
(Submitted on 7 Aug 2023)
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
With the advent of large-scale language models (LLMs) such as GPT-4, it has been considered that LLMs are not only capable of processing traditional natural language tasks such as question answering, natural language reasoning, and text summarization, but may also have the ability to understand human intentions and execute instructions.
Against this background, the development of various applications employing LLM to achieve autonomous goals, such as AutoGPT, BabyAGI, and AgentGPT, has generated great public interest and much discussion.
Despite these advances, a key challenge has been the lack of a systematic and standardized benchmark for evaluating these LLM agents.
To address these issues and evaluate the performance of agents generated by LLM, this paper proposes AgentBench, a comprehensive benchmark consisting of eight tasks and environments based on real-world scenarios and 25 different The paper describes a large-scale comparative experiment using LLM agents.
Composition of AgentBench
An overview of the AgentBench proposed in this paper is shown in the figure below.
AgentBench is a tool that allows LLM-generated agents to interact with an Operating System (OS),Database (DB),Knowledge Graph (KG),Digital Card Game (DCG),Lateral Thinking Puzzles (LTP), House-holding (HH), Web Shopping (WS), and Web Browsing (WB) based on a real-world scenario. Thinking Puzzles (LTP),House-holding (HH),Web Shopping (WS), andWeb Browsing (WB).
Let's look at them one at a time.
Operating System (OS)
For users, the first action is to be able to access and operate the operating system with the LLM in a terminal, and although there have been attempts to translate natural language into shell commands, few prior studies have evaluated it in a real execution environment.
The purpose of this task is to evaluate a series of operations (e.g., recurrently setting a directory file to read-only) in an interactive bash environment (e.g., Ubuntu Docker) on a real operating system.
Database (DB)
It is also very important to examine the ability of the LLM to manipulate the actual database via SQL, since the database is also a typical action for a real user to manipulate the LLM.
Against this background, this task will evaluate the behavior of LLM on a real SQL interface and database as it exists in the real world.
Knowledge Graph (KG)
Working with Knowledge Graphs (KG) requires an agent's ability to break down complex tasks into simpler, more manageable components, and the ability to plan, strategize, and adapt as needed.
Therefore, knowledge graphs are useful for assessing the decision-making capabilities of agents in complex real-world situations, and this task uses knowledge graphs to assess the flexibility and adaptability of agents in decision making.
Digital Card Game (DCG)
Games that require strategy and planning can serve as a simulation environment for agent development, and some recent studies have employed real-world games (e.g., MineDojo), most of which require multimodal capabilities beyond the capabilities of existing LLMs This is a problem.
Against this background, this paper uses digital card games (such as HalfStone) instead, and these games involve rich textual descriptions of cards, turn-based competition, and winning play strategies, and require the ability to make strategic decisions for the agent.
Therefore, in this task, we will evaluate agent performance using the game Aquawar, in which agents manage teams of fish with various abilities as players and fight other teams in a turn-based format.
Lateral Thinking Puzzles (LTP)
Lateral Thinking Puzzles (LTP) is a popular group play game around the world in which players are asked a question, usually related to a riddle, and the moderator answers with either "yes," "no," or "not relevant.
The question is, for example, "A man walked into a restaurant, ordered a bowl of turtle setup, and after finishing it, he committed suicide. A man walked into a restaurant, ordered a bowl of turtle setup, and after finishing it, he committed suicide. Why did he do that? The task is divided into four levels of difficulty.
The agent repeats the question to the moderator, and when the agent has made the inference that leads to the correct answer, the game ends. The evaluation is based on two points: " How quickly did the agent arrive at the correct answer?
House-holding (HH)
ALFWorld is a virtual environment designed to mimic a typical home used in existing research. In this task, the agent is given a description of ALFWorld and instructions that serve as the goal (e.g., place a lamp on the table).
Feedback is then given from the simulated environment each time the agent acts, to evaluate the final agent's ability to accomplish the task.
Web Shopping (WS)
Online shopping has become an important part of modern life, and Webshop, an existing virtual online shopping environment, is useful for evaluating agents' reasoning and decision-making abilities, such as searching, browsing, and selecting products desired by users on a website.
In this task, after entering environmental information and a prompt that tells the agent the format in which the agent should respond, the user instructs the agent what product he or she wishes to purchase.
The agent then follows the prompts to search for products by using a search engine or clicking a button, and evaluates the user's ability to accomplish a series of tasks up to the purchase of products similar to the user's wishes.
Web Browsing (WB)
Mind2Web, recently released, serves as a general benchmark for developing and evaluating agents that can perform complex tasks in various website domains based on high-level user instructions.
This task uses this Mind2Web to evaluate the agent's ability to accomplish tasks when given high-level instructions by the user (e.g., rating 4 or higher, duration 3-6 hours, obtain an intermediate programming course, add to cart and check out).
Evaluation of AgentBench
In order to conduct a systematic investigation of the performance of agents generated by existing LLMs, this paper presents an extensive evaluation using AgentBench on 25 different LLMs, including API-based and open source. (Due to limited computational resources, open source LLMs were only included for models below 30B.)
An overview of all models is as follows
In addition, to facilitate agent evaluation, the authors have designed an evaluation toolkit that allows AgentBench to be easily customized with any LLM model, making it easy to set up a model server with AgentBench's corresponding standard format API to LLMs can now be evaluated.
A summary of the evaluation results using AgentBench is shown in the figure below.
As can be read from the figure, while GPT-4 and other API-based LLM agents show strong performance, there is a clear performance difference between the open source and API-based models.
Additionally, the table below shows the overall AgentBench scores by model. (VER represents the model version and OA represents the overall AgentBench score derived from a weighted average of all tasks.)
Similar to the previous figure, GPT-4 shows the best performance in seven of the eight AgentBench tasks, and achieves high performance in all other API-based LLMs, albeit with somewhat poorer performance.
On the other hand, most open source LLMs perform much less well than API-based LLMs, and even the best performing open source model, openchat-13b, shows a clear performance difference from gpt-3.5-turbo.
This is in contrast to recent findings that some open source LLMs are comparable to gpt-3.5-turbo and gpt-4, underscoring the need for further efforts to improve the performance of open source LLMs.
Summary
How was it? In this article, we described a paper that proposed AgentBench, a comprehensive benchmark composed of 8 tasks and environments based on real-world scenarios to evaluate the performance of agents generated by LLM, and conducted a large-scale comparison experiment using 25 different LLM agents, including API-based and open source models. The paper described a large-scale comparative experiment.
In the large scale comparative experiments conducted in this paper, while GPT-4 and other API-based LLM agents demonstrated superior performance, significant performance differences between them and open source LLM agents were evident.
It is hoped that this paper will lead to the development of open source models, as they are essential for LLM agents to be increasingly implemented in society and to be able to address real-world challenges.
The details of each of the AgentBench tasks and comparative experiments described in this paper can be found in this paper for those who are interested.
Categories related to this article