A Framework Is Now Available That Brings Out Performance Beyond That Of GPT-4 By Allowing Diverse Agents To Debate Each Other!
3 main points
✔️ Proposed RECONCILE, a multi-agent framework designed for a debate process among diverse agents
✔️ Achieved performance equal to or better than GPT-4 by having agents with low performance debate each other
✔️ Obtained external feedback from diverse agents Successfully improved the performance of GPT-4 by getting external feedback from a variety of agents
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
written by Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal
(Submitted on 22 Sep 2023)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence(cs.AI); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
In recent years, efforts have been made to mimic various human cognitive processes, such as reflecting on one's own predictions and learning from feedback, with the goal of improving the inferential capabilities of large-scale language models (LLMs).
In addition to these efforts, there is widespread research to incorporate a society of minds (i.e., the idea that a mind arises when many agents gather and interact with each other) into multi-agent systems in order to promote more diverse thinking.
Because of these developments, communication among multiple agents plays an important role in complex decision making, multi-agent debating frameworks (Liang et al. 2023), in which multiple agents debate and derive a final answer Methods such as multi-agent debating frameworks (Liang et al. 2023), in which multiple agents debate to arrive at a final answer, have attracted much attention.
On the other hand, even though such a framework increases the diversity of reasoning through the debate process, multiple agents are usually restricted to different instances of the same underlying model, ChatGPT, which leads to model-specific biases and feedback from other models was insufficient to provide feedback from the other models.
To solve these problems, this paper proposes RECONCILE, a multi-agent framework that designs a discussion process among diverse agents, learning from various insights and external feedback derived from agents generated from different language models, The paper will describe how it enables the solution of complex inference problems.
RECONCILE: A Group Discuss And Convince Framework
When faced with complex reasoning tasks, humans are known to utilize the power of collective intelligence, also known as the society of the mind, such as collective brainstorming.
Inspired by this, this paper proposes RECONCILE, a multi-agent framework that improves reasoning ability by allowing agents generated from multiple LLMs to debate each other.
A diagram illustrating the differences between RECONCILE and existing methods is shown below.
Most existing methods (Self-Refine, Multi-Agent Debate, MAD+Judge) rely on a single model such as ChatGPT, but RECONCILE incorporates various models such as ChatGPT, Bard, and Claude2. However, this method, RECONCILE, incorporates various models such as ChatGPT, Bard, and Claude2.
In addition, the approach incorporates a variety of innovations to make the discussion effective, such asconvincing other agents to improve their answers and incorporating Confidence Estimation (a method of creating a set of approximations of unknown parameters of probability) from all agents. incorporating innovations.
The figure below shows an overview of RECONCILE with ChatGPT, Bard, and Claude2.
As shown in the figure, RECONCILE operates through the following three phases
Phase1: Initial Response Generation
Phase 1, Initial Response Generation, instructs the agent to reason step-by-step about a given problem according to the Initial Prompt shown below.
In addition, the agent is asked to calculate a confidence level (confidence level) of 0 to 1 for the generated responses.
Phase2: Multi-Round Discussion
In Phase 2 Multi-Round Discussion, after the Discussion Prompt shown below is presented, multiple rounds of discussion begin between agents.
In each round of the debate, all agents revise their own responses based on the responses of other agents in the previous round.
The discussion will then be terminated when pre-defined stopping criteria (e.g., consent is obtained from all agents or the maximum round limit is reached) is met.
Phase3: Final Answer Generation
In Phase 3, Final Answer Generation, after the debate process is completed, the final answer is generated by a vote by each agent.
Whereas multi-agent frameworks in existing research rely on a single model, such as ChatGPT, limiting the complementary opinions obtained from different models and the benefits of ensemble learning, our method combines multiple models to improve robustness and overall accuracy The method is based on the following principles.
In addition, the new incorporation of confidence estimation into the multi-agent system makes it easier for each agent to improve his or her arguments and output more convincing answers.
The figure below shows the main differences between RECONCILE and existing studies that summarize these differences.
As the figure shows, RECONCILE includes all the elements that have not yet been implemented in existing research, thanks to the various innovations described above.
Experiments
In order to demonstrate the effectiveness of RECONCILE, this paper conducted experiments using three LLMs, ChatGPT, Bard, and Claude2. (gpt-3.5-turbo-0613 was used for all implementations including ChatGPT)
The datasets used StrategyQA and ECQA to assess reasoning ability and GSM8K and AQuA to assess mathematical ability, and accuracy and standard deviation were recorded for all tasks.
In addition, the following three categories are used to classify the experiments
- Vanilla Single-agent: includes standard prompts by ChatGPT, Bard, and Claude2, asking models to answer questions step-by-step (GPT-4 is also used for comparison)
- Advanced Single-agent: Implement on ChatGPT a combination of two methods (SR+SC): Self-Refine (SR), which iteratively generates feedback utilizing the model itself and uses that feedback to improve output, and Self-Consistency ( SC), and a method that combines the two (SR+SC) on ChatGPT.
- Single-model Multi-agent: Implement two recently proposed methods for multi-agent debates between multiple instances of ChatGPT (Debate) and adding a judge to monitor the debate process (Judge)
The results of the experiment are shown in the table below.
The most notable aspect of this result is that for all four datasets,RECONCILE implemented using ChatGPT, Bard, and Claude2 outperforms all single-agent and multi-agent baselines built on these agents Point.
In addition, the method outperforms GPT-4 (top row) on datasets such as StrategyQA and ECQA, which require inference capability.
This result demonstrates that RECONCILEcan match orexceed GPT-4 by utilizingthree relatively low-performing agents (ChatGPT, Bard, and Claude2), which shows the effectiveness of this framework.
Next, this paper investigates the effects of having the most powerful LLM, GPT-4, participate in multiple rounds of debate with relatively low performing agents.
Specifically, we replaced the ChatGPT used in the previous experiment with GPT-4 and recorded the accuracy obtained by each agent at the end of each round of debate when RECONCILE was implemented using GPT-4, Bard, and Claude2. (The dataset used was StrategyQA.)
The experimental results are shown in the table below.
As shown in the table, the accuracy of each agent improves as the rounds increase, confirming that all models benefit mutually from the debate.
In particular, GPT-4 recorded a 10% improvement in accuracy, indicating thatpowerful agents may be able to enhance their own performance by obtaining useful external feedback from relatively poor performing agents.
Summary
How was it? In this issue, we described a paper that proposed RECONCILE, a multi-agent framework designed for a discussion process among diverse agents, which enables the solution of complex inference problems by learning from various insights and external feedback derived from agents generated from different language models. The paper described a paper that made it possible to solve complex inference problems by learning from various insights and external feedback derived from agents generated from different language models.
The experimental results revealed that RECONCILE can bring out performance beyond that of GPT-4 by combining agents with relatively low performance, and that GPT-4 performance can be further improved by obtaining external feedback from a variety of agents.
These findings indicate the potential of utilizing a variety of agents in a multi-agent system that solves complex tasks through discussion and provide significant implications for future research.
On the other hand, there are still some problems with performance considerations , such as the fact that the LLM models used in this framework are all API-based, which means that the training data and parameter scales are black boxes.
In response to this point, the author states that these problems will be alleviated in the future with the advent of higher-performance open source models, so we look forward to future developments.
The details of the framework and experimental results presented here can be found in this paper for those interested.
Categories related to this article