ChatEval, An Evaluation Framework That Allows AI Agents To Discuss With Each Other, Is Now Available!
3 main points
✔️ Proposes ChatEval, a multi-agent framework that allows multiple agents to discuss and evaluate autonomously
✔️ Group discussion among Debater Agents enables evaluation similar to human annotators
✔️ Diverse roles in the evaluation process Demonstrates the need for annotators with
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
written by Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu
(Submitted on 14 Aug 2023)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Evaluating the quality of text generated by language models and written by humans is an important issue, and the traditional approach has been to have humans annotate the text, but this approach has been said to be impractical in terms of time and cost.
Throughout this history, automatic evaluation metrics based on n-grams such asRouge, BLEU, and METEOR have been proposed, but it has been suggested that these methods have a weak correlation with human judgment, especially for tasks involving open-ended generation and tasks requiring domain-specific expertise The following is a summary of the results of this study.
On the other hand, recent advances in the field of natural language processing have led to the emergence of billion-parameter large-scale language models (LLMs) such as GPT-3, and a method called LLM-as-a-judge, which employs LLMs as annotators to evaluate the quality of conventional natural language generation tasks, including responses to free-form questions and summaries, was proposed. judge, which employs LLMs as annotators to evaluate the quality of traditional natural language generation tasks, including responses to open-ended questions and summaries.
These methods were devised only on a single agent basis, where a single generating agent evaluates the evaluation, but in the human evaluation process, reliance on a single viewpoint often leads to bias and instability in the evaluation, and the possibility of such bias in the case of agents was also a concern. In the case of agents, the possibility of such bias was also a concern.
Against this background, this paper proposes ChatEval, a multi-agent framework in which multiple agents given specific expertise discuss and evaluate autonomously, and describes a paper that demonstrates the necessity of annotators with diverse roles in the evaluation process.
ChatEval
As shown in the figure below, this paper proposes ChatEval, a multi-agent-based framework that allows for a more human-annotator-like evaluation compared to single-agent-based approaches.
Debater Agents
Debater Agents are one of the most important components in this framework and refer to agents that have specialized knowledge and are instructed to generate answers from given prompts.
After setting up Debater Agents, a group discussion is initiated and each agent autonomously receives replies from other agents and in turn sends its own replies.
In this way, multiple agents can participate in the evaluation process as referees, and the referees can discuss among themselves, ultimately resulting in an evaluation that is closer to that of the human annotator.
Communication Strategy
How to maintain chat history between agents is another important issue in ChatEval, and this paper uses a design called Communication Strategy to maintain chat history.
As shown in the figure below, the framework employs three different Communication Strategies :One-by-One, Simultaneous-Talk, and Simultaneous-Talk-with-Summarizer. (The direction of the arrow represents the flow of information, meaning that the person's statements are added to the chat history of the person to whom the arrow points.)
In One-by-One, in each round of discussion, Debater Agents generate responses based on their expertise, in a predetermined order; when it is time for a Debater Agent to respond, what other agents have said is added directly to that agent's chat history.
In Simultaneous-Talk, unlike One-by-One, the discussion is asynchronous and out of order. Here, the Debater Agent is prompted to generate responses asynchronously during the discussion to nullify the influence of the order of utterances.
Simultaneous-Talk-with-Summarizer differs from the aforementioned Simultaneous-Talk in that it adds a Summarizer agent as a summarizer, and at the end of the discussion, this agent summarizes the previous messages and adds a summary to all At the end of the discussion, this agent summarizes the previous messages and adds the summary to the chat history of all Debater Agents.
Experiments
In order to demonstrate the effectiveness of the proposed method, ChatEval, we conducted comparative experiments using human annotators and adding the existing method, FairEval.
In ChatEval, we used two conditions: Single-Agent with only a single agent and Multi-Agent with multiple Debater Agents.
It employs the same evaluation approach as existing methods and evaluates annotation results generated by human annotators and LLMs. The evaluation metrics used are Accuracy (Acc.), which measures the percentage of correctly classified instances out of all instances, and Kappa correlation coefficient (Kap.), which measures the agreement between the model and human annotator results.
The results of the comparison experiment were as follows
As indicated by the bold lines in the table, the proposed method, ChatEval, performed the best in both evaluation methods, indicating the effectiveness of the proposed method.
In addition, the paper includes a qualitative analysis, beginning with an open-ended question to the two assistants, "What are the most effective ways to deal with stress? (What are the most effective ways to deal with stress?).
Assistant 1's response is shown below.
Assistant 2's response is shown below.
The evaluation process by Alice, Bob, and Carroll's three Debater Agents for this response is shown in the figure below.
After receiving the two assistants' responses, Alice first pointed out that Assistant 2's response contained more detail and argued that she gave a better answer.
Bob, on the other hand, agrees with Alice's assessment and argues that Assistant 1's answer also asks a succinct and thought-provoking question, and Carol provides feedback that both answers are equally valuable.
In the ensuing discussion, Bob indicated that Assistant 1's responses were frank while Assistant 2's responses were detailed, and at the end of the discussion, Bob output the same evaluation results as the human annotation results.
The above demonstrates that ChatEval is not just a rating tool, but also captures nuances that are often missed from a single perspective by simulating the exchange of human discussion.
Summary
How was it? In this issue, we described a paper that proposed ChatEval, a multi-agent framework in which multiple agents given specific expertise discuss and evaluate autonomously, and demonstrated the need for annotators with diverse roles in the evaluation process.
This paper is rich in suggestion, proving that generating agents with various role settings in the evaluation process and having them engage in discussions supports comprehensive evaluations that are closer to human judgments.
The details of the ChatEval evaluation process and comparative experiments described in this paper can be found in this paper for those interested.
Categories related to this article