Catch up on the latest AI articles


New "ToolQA" Dataset: Assesses The Ability Of Large Language Models To Solve Problems With External Tools

Large Language Models

3 main points
✔️ Developed a new dataset, ToolQA, to assess how effectively large-scale language models use external tools.
✔️ Large-scale language models show limited performance on difficult problems in ToolQA and exhibit error trends.
✔️ Expect to further improve the ability to use external tools by having large language modelslearn how to use external tools.

ToolQA: A Dataset for LLM Question Answering with External Tools
written by Yuchen ZhuangYue YuKuan WangHaotian SunChao Zhang
(Submitted on 23 Jun 2023)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Large-scale language models have shown high performance in a variety of natural language processing tasks, and their versatility has led to their practical application in various fields, including ChatGPT. However, there are several problems. One is "illusions. This is the fact that large-scale language models produce information that does not exist. Another is weak numerical reasoning capability. It is said to be poor at solving complex problems using numerical values.

Currently, external tools are being used to augment the capabilities of large-scale language models to solve these problems. For example, Wolfram plug-ins can be used to augment numerical inference capabilities. Other external tools connected to fact-checked databases can be used to refer to correct information and reduce illusions.

However, when using external tools, there is no way to determine whether the large-scale language model solved the problem with internal data that has already been trained or with external data from external tools. We have not been able to evaluate, for example, whether external tools are being used appropriately.

Therefore, in this paper, we introduce a new dataset, ToolQA, to assess the extent to which large-scale language models use external tools. in ToolQA, all questions are designed to be answered only by using the appropriate tools and obtaining external data. This minimizes the likelihood that a large-scale language model will use learned internal data to answer a question and allows us to assess its ability to use external tools.

What is ToolQA?

ToolQA is a dataset designed to assess the ability of large-scale language models to answer questions using external tools. The dataset consists of eight topics, each represented as a tuple of questions, answers, external data, and tools. The reference data is queryable data, represented as a text corpus, tabular database, or graph. To obtain reference data, 13 tools are provided that can be used for text search, database manipulation, code interpretation, mathematical calculations, etc. The questions included in ToolQA are those that cannot be answered using only the learned internal knowledge of a large-scale language model and require the use of tools to obtain reference data. The following table shows the statistics of ToolQA. The table below shows the statistics for ToolQA.

ToolQA question and answer pairs are created in the three steps shown in the figure below. First, in Figure (a) Reference Data Collection below, the large-scale language model collects untrained reference data. Next, in (b) Human-Guided Question Generation, questions that require the use of reference data are generated by the large-scale language model and a human guide. Finally, in Figure (c) Programmatic Answer Generation, the correct answers are automatically generated for the questions created in Figure (b) below.

(a) External data collection: Reference Data Collection

About Figure (a) above. The purpose of this dataset is to evaluate the ability of large-scale language models to utilize external tools. Therefore, we need to guarantee that the large-scale language model will use external tools. That is, we need to ensure that the large-scale language model cannot answer questions using only the internal data it has been trained with. Therefore, we have established three criteria for collecting external data.

  • External data does not duplicate the internal data of the large-scale language model.
  • External data should include context-dependent facts to generate questions that cannot be answered by the internal data of the large-scale language model alone.
  • Large-scale language models should be able to retrieve all necessary information from external data and answer questions correctly.

We then collect external data from six perspectives.

  • Time: To create questions involving time series, we use datasets that contain up-to-date information such as "Flights" and "Coffee". For example, we use the latest flight schedules, coffee prices, and other up-to-date information that large language models do not have as internal data.
  • Spatial: This is for creating questions related to geographic or location information, specifically using datasets such as "Yelp (restaurant rating information)" or "Airbnb (lodging information)". This allows us to create questions with spatial context, such as information about a specific location and its relevance to that location.
  • Math: In order to create mathematical questions, we use the " GSM8K " dataset. These are questions that ChatGPT cannot answer with its own mathematical abilities.
  • Science: We use the " SciREX " dataset to create questions about scientific concepts and data. This contains information on scientific areas that ChatGPT tends to fabricate.
  • Personal: To create questions about personal information and schedules, we use the "Agenda" dataset, which contains virtual names and events generated by ChatGPT. This allows us to create questions about personal data while protecting the privacy of individuals.
  • Social: To create questions about social relationships, we select the most recent data from the " DBLP" dataset and create a graph showing the relationship between authors and papers. This allows us to create questions based on social relationships that are not understood by large-scale language models.

In addition, there are 13 tools available for large-scale language models to obtain information from these external data, as listed in the table below.

Text Tools, a tool for handling text information, offers two text search tools, "Agenda Retriever" and "SciREX Retreiver. Agenda Retriever" retrieves personal data from the "Agenda" dataset, while "SciREX Retreiver" retrieves scientific data from "SciREX".

The "Database Tools" tool for handling information from databases includes "Database Loader," "Data Filter," and "Get Value. Database Loader" is a tool to load data from a local tabular database. The Data Filter tool filters the database by column name, relationship, and value (e.g., "Date=2022-10-15"). Get Value" is a tool that returns all values for a specific column in the database.

Math Tools, a tool for handling mathematical abilities, offers the WolframAlpha Calculator. This is a tool that handles input strings as mathematical formulas and can perform calculations. It can perform various calculations such as simple four arithmetic operations, calculation of averages and maximum values, etc.

Graph Tools, a tool for handling graph information, includes Graph Loader, Neighbour Checker, Node Checker, and Edge Checker. The Graph Loader is a tool that loads a graph from a local file. Neighbour Checker" lists all the neighboring nodes of a query node in a graph. Node Checker" and "Edge Checker" return detailed attribute information for query nodes and edges, respectively.

The "Code Tools" tools for handling code information include the "Python Interpreter" and the "SQL Interpreter". These tools interpret and execute Python code and SQL queries, respectively. These tools also receive data from other tools (Text Tools, Database Tools, etc.) and convert the received data into the appropriate format (SQL query, Python code, etc.). This converted data is then used by the next process (or other tool) to extract the necessary information or perform more complex operations. In this way, the "Python Interpreter" and "SQL Interpreter" function as a bridge to exchange information between various tools and the large-scale language model. This allows the large-scale language model to combine various tools to manipulate information and generate the final answer.

Finally, "Finish" is provided as "System Tools". This takes the output generated by other tools and parses it. For example, it analyzes information obtained by text search tools, results from databases, and the results of code executed by Python and SQL, and uses this information to form the final answer. It is important to understand the results obtained and to present the results to the user in the most appropriate form. This allows us to provide a clear and understandable final answer to the user's query.

(b) Human-Guided Question Generation

Let's look at (b) below. There are two possible ways to generate questions: one is to have experts create questions about the external data, and the other is to have a large-scale language model generate questions about the external data. Relying solely on experts can produce high-quality questions, but it is labor-intensive, time-consuming, and difficult to scale. Relying solely on large-scale language models makes scaling up easier, but may produce low-quality questions, such as questions that cannot be answered. Furthermore, the questions generated by a large-scale language model may include questions that can be answered using only the internal data of the large-scale language model. Therefore, this paper proposes a method to prepare question templates and generate questions in the large-scale language model with a human guide.

First, we prompt ChatGPT, "Based on the given information, please generate some question templates and suggest corresponding answers." and let it generate candidate question templates from the external data. Next, a human verifies and selects question templates that cannot be answered by the internal data of the large-scale language model, but can be answered by using the external data. The figure below shows a question template for the external data "Flight.

Also shown below is a question template for the external data "Yelp".

After manually selecting a question template, values are sampled from external data and automatically embedded in the question template to generate specific questions. For example, given the template "Was the {Date} flight from {Origin} to {Dest} cancelled?" the values "LAX", "MDW", and "01/09/22" are sampled from the external data "Flight" and embedded in the template to generate a question like "Was the flight 01/09/22 from LAX to MDW cancelled? to generate questions such as "Was the flight from LAX to MDW cancelled? In addition, questions are categorized into easy and difficult questions according to their difficulty level. The figure below is a sample of easy questions for the external data "Airbnb.

Below is a sample of difficult questions for the reference data "Airbnb".

(c) Programmatic Answer Generation

This is about (c) in the figure below. Here, answers to the questions generated in (b) are generated.

Two elements are used to generate responses: the "operator" and the "tool chain". The "operator" is the part of the program that performs a specific operation, such as retrieving information from a database. And the "tool chain" is the part of the program that combines multiple different "operators" to produce the final answer. For example, first retrieve the necessary information (date, place name, etc.) from a question, and then execute a series of "operators" using the "tool chain" based on that information to obtain the final answer. The figure below is a sample. This sample code answers the question, "What percentage of flights from {destination} on {flight date} were delayed?" which answers the question "What percentage of flights from {date of flight} to {place of departure} were delayed?


ToolQA is used to evaluate performance in four ways: first, using ChatGPT, where questions are entered into ChatGPT and the responses are taken as the final answers; second, using ChatGPT with Chain-of-Thought (CoT); and third, using ChatGPT with Chain-of-Thought (CoT). The second is a ChatGPT with CoT (Chain-of-Thought). The third is Chameleon. The third is Chameleon, a multi-tool problem-solving method that uses a large-scale language model as a controller. When Chameleon is used with ToolQA, the tools are the ones provided in this paper. The fourth is using ReAct, which is implemented using gpt-3.5-turbo and text-davinci-003. The table below shows the results of each method on easy questions.

The table below shows the results of each method for difficult questions.

Those using ChatGPT and CoT have low success rates (< 10%) in both tasks, for both easy and difficult questions. ReAct has the best success rate.

A comparison of the results for easy and difficult questions also shows that all methods perform significantly worse on difficult questions. The method with the best performance on easy questions has an average success rate (Average) of 43.1%, but that number drops to 8.2% on difficult questions. Furthermore, a comparison of the two versions of ReAct shows that ReAct (GPT-3) performs better than ReAct (GPT-3.5) on easy questions, while the opposite is true for hard questions.

Analysis of experimental results

Here is an analysis for ReAct (GPT-3.5), which showed the highest performance for the difficult questions in ToolQA. For the most common errors, these are cases where the tool or program makes a mistake with an argument that is necessary for it to function correctly. As shown in the figure below, argument errors account for 44.56% (168/377) of the easy questions and 48.23% (210/436) of the hard questions.

There is also a difference in the pattern of argument errors between easy and difficult questions. In easy questions, errors tend to occur in database-related tools (LoadDB, FilterDB, GetValue). In easy questions, the number of errors is 120, while in difficult questions it is 95. On the other hand, for code-related tools (SQL, Python), the number of errors is about 10 times higher for difficult questions than for easy questions. This is likely due to the fact that the solution logic for difficult questions is more complex and cannot be fully inferred from context alone. As a result, large language models tend to rely on understanding of code and programming concepts to solve these difficult questions. In contrast, for easy questions, the large-scale language model would tend to follow the patterns provided by the context and combine different database operations to arrive at a solution.

It also reveals that large-scale language models have difficulty finding appropriate external data. For example, as shown in the figure below, when data containing time information is needed, such as "Flight," "Coffee," "Airbnb," and "Yelp," another data, "Agenda," is frequently incorrectly referenced (the color is lighter in the agenda column in the figure below).

Also, the large-scale language model tends to be confused about data related to scientific domains such as "SciREX" and "DBLP" because it cannot properly determine which of these data to refer to (the color in the "scirex" column in the figure above is lighter). This indicates that large language models have difficulty properly determining which data should be used to answer a question, which also contributes to the error.

In addition, while it is common for large-scale language models augmented with external tools to include a description and use cases for each tool in the prompts they enter, the complexity of the problem increases as the number of tools increases, making it difficult to include use cases for all tool combinations. As a result, large-scale language models must, in some cases, find relationships between tools that are not included in the human-entered examples, for example. In this paper, we call this "Innovation". However, finding such rules that have not been input by a person also carries a high risk of hallucination. The figure below illustrates this phenomenon in a case where a large-scale language model refers to "Coffee" data to answer a difficult question.

In the difficult question in Figure (a) above, ReAct (GPT-3) strictly follows the operations displayed in the context and, as a result, fails. Conversely, ReAct (GPT-3.5) identifies the SQL interpreter as an alternative to the database operations, based on the relevance of the tool, when the database operations repeatedly fail. However, when there is such "Innovation," the risk of "Hallucination" is also high. As shown in Figure (b) above, when answering another difficult question from the "Coffee" data, ReAct (GPT-3.5) seems to hallucinate something that is not present in the tool's execution results (yellow areas).

We reviewed all errors that occurred in the ReAct (GPT-3.5) model and the breakdown of errors in both easy and difficult questions is summarized in the figure below.


This paper proposes a dataset, ToolQA, which evaluates the ability of large-scale language models to use external tools. It also evaluates the performance of standard large-scale language models and large-scale language models extended with external tools against ToolQA. The results show that even the baseline with the best performance for difficult questions in ToolQA achieves only limited performance. They also found that current large-scale language models extended with external tools tend to make errors such as incorrect tool calls and use of incorrect data sources. The research team expects that fine-tuning large-scale language models with a dataset of external tool usage will improve the ability of large-scale language models to use external tools and will conduct further research.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us