Catch up on the latest AI articles

GAIA: New Benchmark Reveals Limitations Of Large-Scale Language Models

GAIA: New Benchmark Reveals Limitations Of Large-Scale Language Models

Large Language Models

3 main points
✔️ Propose a new benchmark, GAIA, that can evaluate the performance of AI assistants using 466 questions involving everyday tasks and scientific problems.
✔️ Current large-scale language models have rich knowledge and fluent sentence generation, but there are challenges in how they are evaluated for real-world tasks and complex problems.
✔️ Evaluation with GAIA shows that advanced models such as GPT-4 score low, revealing their limitations for complex real-world tasks.

GAIA: a benchmark for General AI Assistants
written by Grégoire MialonClémentine FourrierCraig SwiftThomas WolfYann LeCunThomas Scialom
(Submitted on 21 Nov 2023)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.

First of all

In recent years, large-scale language models have begun to show their potential as versatile models that can be used for a variety of purposes. As evidenced by ChatGPT and Bard, which are already in use by many, modern models have a wealth of knowledge, can generate sentences fluently, and can be adapted to human preferences. Moreover, these models can be combined with plug-ins for web browsing, code interpretation, and more to achieve greater sophistication.

At the same time, however, a major question is how to evaluate these evolving models. Large-scale language models are demonstrating new capabilities and achieving SOTA performance in various benchmark tests. And current trends suggest testing large-scale language models with tasks that are difficult even for humans. For example, tasks such as higher-level evaluations in science or law, or writing a coherent book. However, tasks that are difficult for humans are not necessarily difficult for large-scale language models.

This situation calls for renewed consideration of benchmarks for evaluating new AI models. For example, one possible approach would be to have an AI solve a task that is conceptually simple but requires a series of complex actions to be executed in a precise manner. Such tasks need to be solved until successful execution results are obtained, and the output can be easily verified. AI assistants, in particular, are based on real-world examples and meet this criterion.

Therefore, this paper proposes a new benchmark, GAIA, focusing on AI assistants.

GAIA consists of 466 questions/answers and associated design methodologies. These questions are relatively simple to create, challenging for AI models, and have unique, factual answers, allowing for easy and robust automated evaluation.

Existing large-scale language model benchmarks are specific and limited to closed and synthetic environments, but inherently need to browse the open and changing web, handle multimodality, and reason across multiple steps. GAIA evaluates in a more real-world environment GAIA aims to provide a more real-world environment for evaluation.

In fact, even large language models that achieve high performance on tasks that are difficult for humans perform poorly on GAIA. Even when leveraging plug-ins, GPT-4 fails to exceed a 30% success rate on even the easiest tasks, and shows a 0% success rate on the most difficult tasks. In contrast, the average success rate for humans is 92%. If the proposed GAIA can achieve high performance, it will be recognized as an important milestone toward the next generation AI model.

What is GAIA?

GAIA is a benchmark that utilizes an AI assistant. It consists of 466 human-designed questions; these questions are text-based, but some are accompanied by images, spreadsheets, or other files. These questions cover a variety of assistant use cases, including everyday personal tasks, science, and general knowledge. The questions are short and designed to have a correct single answer, making them easy to validate.

Below are sample GAIA questions. Completing these tasks requires basic competence in reasoning, handling multimodality, and techniques for using plug-ins. Some questions reflect real-world use cases and include additional material such as images.

Assessment in GAIA is automated and designed to be quick and factual. In practice, each question requires an answer: a string (one or several words), a number, or a comma-separated list of strings or floating-point numbers. And there can only be one correct answer. Thus, the evaluation is based on the quasi exact match between the model's answers and the facts. As shown in the figure below, prompts are used to inform the model of the required format. A scoring function and leaderboard are also provided.

Performance Evaluation by GAIA

GAIA uses a prefix prompt that specifies a specific format for asking questions to the model. Using this approach, we are evaluating GPT-4 (with and without plug-ins) and AutoGPT, which uses GPT-4 as a backend. Currently, GPT-4 requires manual plug-in selection, while AutoGPT can do this automatically.

Thus, GPT-4 offers an "advanced data analysis mode" that provides code execution and file reading capabilities, depending on the task the user is given, as well as a manual selection of three third-party plug-ins: a link reading tool, a web browsing tool, and a calculation tool. However, a stable set of plug-ins over a period of time is not currently possible, as GPT-4 plug-ins change frequently and disappear from the store. Therefore, GPT4 scores with plug-ins are treated as oracle estimates ("oracle" estimate, a guess based on ideal circumstances). In addition to this, human annotators and web searches are also used as a basis for comparison. In web search, a question is entered into a search engine to see if the answer can be derived from the first page of results.

The results of the GAIA evaluation of GPT-4 (with and without plug-ins) and AutoGPT are shown in the figure below; the difficulty levels proposed in the GAIA correlate with the performance of the current models, supporting their validity. Humans score well on all levels, while large language models, which are considered the best performers today, score very low.

A human web search may be able to come up with the right answer to a level 1 question, but may not work for slightly more complex queries. This indicates the potential for AI assistants to compete with search engines. Comparing GPT-4 without plug-ins to other results shows that extending the large language model with plug-ins and access to the web improves answer accuracy; AutoGPT-4, which allows GPT-4 to automatically use the tool, is more accurate, especially at Level 2 and Level 1, than GPT-4 without plug-ins scores lower than with the GPT-4 without the plugin. The scores obtained for each task are also shown in the figure below.


This paper reviews benchmarks for large-scale language models, with a focus on AI assistants, and proposes a new benchmark called GAIA.

GAIA is not specific to a particular performance evaluation, as in traditional benchmarks, but consists of diverse and challenging questions rooted in the real world that are conceptually simple but may be cumbersome to humans. Interpretability is also taken into account. The limited number of carefully selected questions makes it easy to use. The conceptual simplicity of the task (92% human success rate) also makes it easy to trace the inferences of the model. In addition, GAIA is designed to be less easily gamed than traditional benchmarks. To complete a task, several steps must be planned and executed accurately. These tasks are no longer solved by brute force due to their variety and many action patterns, making cheats less likely to work.

GAIA's answers to questions are factual, concise, and clear. This allows for easy, quick, and factual assessments.

However, there are some challenges: the performance of a model like GPT, which can only be accessed through the API, can change over time, so an evaluation at a particular point in time may not be reproducible later. In addition, the plug-ins for ChatGPT change periodically and are not accessible through the API, making it even more difficult to reproduce evaluations.

In addition, the GAIA contains many hand-picked questions, the validity of which may diminish over time as they become outdated or lost on the web. In addition, GAIA questions must be clear and unambiguous. Multiple annotators are needed to accomplish this, but this process is costly. Additionally, GAIA conducts all questions in English and does not address content related to non-English speaking populations or non-English speaking webs. This means that GAIA can only evaluate the usefulness of AI assistants within the English-speaking range.

While GAIA is a useful benchmark, it has several limitations: repeatability of the evaluation process, deterioration of questions over time, cost of question design, and lack of language and cultural diversity. These limitations should be understood and taken into account for future improvements.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us