Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Large Language Models 29/01/2025

3 main points
✔️Proposetwonew benchmarksto properly assess"human preferences" forLLMoutput
✔️ Validate the utility of LLM-as-a-Judge, which uses state-of-the-art LLMs as evaluators to complement human ratings
✔️ Confirms that GPT-4 ratings are in high agreement with human ratings and as reliable as human ratings

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
written by Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
(Submitted on 9 Jun 2023 (v1), last revised 24 Dec 2023 (this version, v4))
Comments: NeurIPS 2023 Datasets and Benchmarks Track
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In recent years, services using "chat assistants" and "chatbots" have proliferated. These tools utilize Large Language Models (LLM), which have attracted much attention in recent years, and apply supervised fine tuning and reinforcement learning from human feedback (Reinforcement Learning from Human Feedback, RLHF). Feedback (RLHF) is applied toimprove the ability to speak in natural language, tuned to human preferences. However, it is believed that conventional benchmarks cannot adequately assess such conformity to human preferences.

Even in therepresentative benchmarks " MMLU (Massive Multitask Language Understanding) " and "HELM (Holistic Evaluation of Language Models),"the difference in performance between a model adjusted to human preferences and a base It has been found that it is not possible to adequately evaluate the difference in performance between a model adjusted to human preferences and the base model.In other words,there is a large gap between users' evaluation of"chat assistants" and "chatbots" and the conventional benchmarks, and the benchmarks do not work in practice.

This gap is due to the fact that traditional benchmarks can only measure LLM performance on specific tasks (e.g., multiple choice questions or search questions). In particular,they cannot evaluateopen-endedtasks thatdo not have clear objectives, such as tasks to understand user intentions through multiple interactions or to understand people's preferences.

Therefore, this paperproposes twonewbenchmarksto fill this gap: one is the "MT-bench". Thisuses open-ended questions toassess achatbot's ability to converse and follow user instructions.The other is Chatbot Arena. This uses a crowdsourced platform that allowsusers to converse with two chatbots simultaneously and rate their conversations based on personal preferences.The goal is to properly assess a person's suitability for their preferences, which is often overlooked by traditional evaluation methods.

In addition, when evaluating people's preferences, human evaluation is costly. Therefore, we are also testing the usefulness of "LLM-as-a-judge," a state-of-the-art LLM-based evaluation as an alternative method.

New Benchmark Proposal

As mentioned above, LLM-basedservices areused for a variety of purposes, such as text generation, chat, coding, etc., so LLM evaluation must also take into account a variety of perspectives. However, accurately evaluating the capabilities of these LLMs has been a challenging task. Existing benchmarks primarily evaluate simple tasks that answer short questions, and tasks that answer complex interactions or open-ended questions cannot be adequately evaluated.

Existing benchmarks can be categorized into three types: thefirstis the Core-Knowledge Benchmark, of which MMLU, HellaSwag, ARC, Winogrande, HumanEval, GSM-8K, and AGIEval are typical benchmarks. It assesses basic LLM knowledge learned in advance and requires short, specific answers.

The second is the Instruction-Following Benchmark, of which Flan, Self-instruct, NaturalInstructions, and Super-NaturalInstructions are typical benchmarks.Theseassess responses to more complex instructions and tasks.

The third is the Conversational Benchmark, of which CoQA, MMDialog, and OpenAssistant are representative benchmarks. These can evaluate human interaction performance, but are considered insufficient to measure the performance of modern chatbots.

Although various benchmarks have been published, all of themhave difficulty in evaluating user preferences and the utility of chatbots in human interaction withLLMs.To solve this problem, weproposetwo new benchmarks, MT-bench andChatbot Arena, whichevaluate human preferences. And we aim to contribute to the development of chatbots and other AI assistants that are more user-friendly.

MT-Bench

The "MT-Bench" is designed to assess the LLM's ability to interact with multiple interactions and follow user instructions, and consists of 80 questions.

It isalsodesigned aroundeight categories: writing, role play, information extraction, reasoning, mathematics, coding, knowledge I (science and engineering), and knowledge II (humanities and social sciences).Each categoryincludes10 expert-designed, multiple-exchangequestions that allow for a multifaceted assessment of the model's abilities. The table below showsa sample MT-Bench.

Chatbot Arena

Chatbot Arena" is a crowdsourced benchmarking systemthat allowsusersto interact withtwo chat models (model names are hidden) simultaneously and ask each the same question. Users can then compare their answers and vote on which one is better. The model names are published after the voting.The figure below showsthe Chatbot Arena dashboard.

Chatbot Arena is not limited to pre-defined questions, but allows users to freely pose questions, allowing for evaluation in line with real-world use cases. The platform has been in operation for one month and has collected approximately 30,000 votes. The data collection reflects user preferences.

Evaluate The Usefulness of The Benchmark and LLM Judge

As mentioned in the introduction, this paper also examines LLM-as-a-Judge, which substitutes LLMs for performance ratings for human preferences rather than people. To this end, we use MT-bench and Chatbot Arena to investigate the extent to which various LLMs and human evaluations agree.

The survey using MT-bench is a small-scale experiment conducted under specific conditions to investigate not only the agreement between LLMs and people's ratings, but also between people's ratings:six LLMs (GPT-4,GPT-3.5,Claude-V1,Vicuna-13B,Alpaca-13B,LLaMA-13B) were used to generate answers to 80 questions.

These LLM responses are compared by LLMs and by human ratings. The human evaluationswereconductedprimarilyby graduate students withexpertise in the field; the LLMs and humans compared responses and collected data on approximately 3,000 votes; the LLMs evaluated all pairs, while the humans evaluated responses to more than 20 randomly selected multiple-answer questions; and the human evaluationswere conducted by graduate students withexpertise in the field.

The survey usingChatbot Arenais a larger-scale experimentthan MT-bench.Weuse the Internet as a crowdsourceto recruit alarge number ofparticipants toinvestigate thedegree of agreement in ratings byLLMs; Chatbot Arena has about 30,000 data points, and we randomly select 3,000 ballots. Eight LLMs (GPT-4,GPT-3.5,Claude,Vicuna-7B/13B,Koala-13B,Alpaca-13B,LLaMA-13B, andDolly-12B) were included in theevaluation.The evaluators used LLM and the votes of participants (2,114 unique IP addresses) collected via the Internet.

Wealsouse "Agreement" and "Average Win Rate"as evaluation measures:Agreementisthe percentage of agreement to a randomly selected question;Average Win Rateisthepercentage ofwinthat one LLM has over another LLM;andAverage Win Rate is thepercentage ofwinthat one LLM has over another LLM. Average Win Rate indicates how much one LLM wins over another LLM .

Assessment Results

The table below shows the results of the agreement analysis using MT-bench.G4-Pair" and "G4-Single" represent the evaluation methods:G4-Pairuses pairwise comparisonsto compare and evaluate two responses. G4-Single, on the other hand,uses a single answer toevaluate one response in isolation. The "S1" and "S2:" denote the type of setting;S1 includes three types of votes (Non-tie, Tie, Inconsistent), whileS2includesonlyonetype of vote (Non-tie).In each setting, the degree of agreement between random LLMs is represented as "R=".The top of each cell shows the degree of agreement and the gray area at the bottom shows the number of votes.

The results show that the GPT-4 has a very high agreement with person ratings. In both pairwise comparisons and single answers, GPT-4 agreement reached 85% in S2, exceeding the 81% agreement between people. The results also suggest that judgments made by the GPT-4 may improve judgments made by people.

The table below also shows the results of the match analysis using Chatbot Arena.G4" indicates the results evaluated by pairwise comparisons using GPT-4.G4-S" shows the result evaluated by single answer using GPT-4.G3.5" shows the result evaluated by pairwise comparison using GPT-3.5.C" shows the results of pairwise comparisons using Claude. Finally,"H" indicates the result by a human evaluator.It can be seen that theseresults alsoshow the same trend as MT-bench.

In both tables above, the GPT-4 results using single answers show a high degree of agreement with the results from pairwise comparisons and human preferences, indicating that the GPT-4 has stable evaluation criteria. It is considered that it can be sufficiently used as an alternative method for people.

In addition, we compute the agreement for various LLM pairs and categories for further detailed analysis.The graphbelowshows the results of the comparison between LLMs and the agreement between GPT-4 and human ratings for them.Each point on the graph represents a pair of two different LLMs;we count the non-tie votes, where we have determined which of the two LLMs is better. In other words, we only consider votes where it is clear which model won.

The X-axis (Win Rate Difference) shows the difference between the win rates of the two LLMs, witha larger win rate differenceindicating thatone LLMis better than the other; theY-axis value (Agreement) shows how well the GPT-4 and human ratings agree The higher the agreement, the better the GPT-4.The higher the agreement, the more consistent the GPT-4 ratings are with human judgments.

As the difference in win rates between pairs of LLMs increases, the agreement between GPT-4 and person improves from 70% to 100%. This indicates that GPT-4 has a higher agreement rate with person when there is a clear performance difference between LLMs.

Summary

When evaluating AI assistants such as "chat assistants" and "chatbots" based on large-scale language models, existing benchmarks have been unable to adequately assess how well they conform to human preferences (i.e., whether they can produce output that is easy for people to use).

In this paper, we propose two new benchmarks, MT-bench and Chatbot Arena, that can evaluate people's preferences to solve this problem. Furthermore, we utilize LLM-as-a-judge so that people's preferences can be evaluated automatically.

Experimental results confirm that high-performance LLMs, such as GPT-4, show very high agreement with human ratings and are as reliable as human ratings.

In addition, the paper presents the questions and polls used in the benchmark, as well as conversational data reflecting the preferences of nearly 30,000 people(https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge).

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Summary

New Benchmark Proposal

MT-Bench

Chatbot Arena

Evaluate The Usefulness of The Benchmark and LLM Judge

Assessment Results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...

Vript-Hard, A New Benchmark For Testing Comprehension Of Long-form Video

Vript-Hard, A New Benchmark For Testing Comprehension Of Long-form Video