Catch up on the latest AI articles

Meta Develops Toolformer, A Language Model For Learning How To Use Tools

Meta Develops Toolformer, A Language Model For Learning How To Use Tools

Large Language Models

3 main points
✔️ Large-scale language models have an amazing ability to solve problems from only a few examples and instructions
✔️, while simpler tools perform better at computation and fact checking

✔️ To take advantage of the best of both worlds, we propose Toolformer, a language model that self-learns how to use external tools by converting tool invocation instructions into text

Toolformer: Language Models Can Teach Themselves to Use Tools
written by Timo SchickJane Dwivedi-YuRoberto DessìRoberta RaileanuMaria LomeliLuke ZettlemoyerNicola CanceddaThomas Scialom
(Submitted on 9 Feb 2023)
Comments: Published on arxiv.

Subjects:  Computation and Language (cs.CL)


The images used in this article are from the paper, the introductory slides, or were created based on them.


The language model staggeredly spins out sentences that are comfortable to humans in response to all kinds of questions.However, they are often criticized for saying things that are not true or for making miscalculations.

Is this limited to the language model?

It would be people themselves who say things that are not true or make miscalculations.

Nevertheless, I think it can be said that humans differ from the language model in that when facts are important or when calculation errors should not be made, they do not rely on hazy knowledge or rote calculations, but instead use search engines and calculators to find out what they need to know.

Therefore, even in the language model, if you can use tools such as search engines and calculators, rather than relying solely on your own knowledge and abilities, you will be closer to human-like behavior.

In this paper, we propose Toolformer, a method for self-learning how to use external tools via a simple API (Application Programmable Interface, a window to call other tools) so that language models can use tools in the same way as humans. Toolformer is a method for self-learning how to use external tools via a simple API (Application Programmable Interface).

The proposed method self-learns which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into token prediction (the language model spinning the next word). It only takes a few demonstrations for each API to learn how to use the API; tools invoked through the API include a calculator, a Q&A system, two different search engines, a translation system, a calendar, and more.

The proposed method significantly improved performance compared to before learning how to use the tool on a variety of tasks. We found that even with relatively small language models, the tool can achieve performance comparable to that of larger language models.

Need for language models to learn how to use the tool through the API

How can language models learn to use the tool?

The language model can only input and output text.

Therefore, the use of the tool itself must also be expressed as text.

So, the API call is expressed as the following text

[API name (input to API)]->API output

Specific examples of API call text

Now, a concrete example of the text is shown in Figure 1.

Figure 1. example of learning data needed to learn how to learn the tool

The first example shows the use of the QA API. The first example uses an API called QA, which is a question-and-answer system, followed by a token prediction: "The New England Journal of Medicine is a registered trademark. The QA system is asked, "Which trademark is The New England Journal of Medicine a registered trademark? The question-and-answer system is then asked, "Which trademark is The New England Journal of Medicine registered with? As a result, the question-and-answer system answers "Massachusetts Medical Society" and makes a token prediction that it belongs to the MMS.

The second example shows the use of the Calculator Calculator API. Following the token prediction "400 out of 1400 participants (in other words)," the calculator is called and 400/1400 is entered, yielding a result of 0.29. Following that calculation, the token predictor says, "29% passed the test.

The third example is an example using the MT API, a translation system. Its name comes from "la tortuga," which means "in Spanish. Following the token prediction "in Spanish," the translation system is called and tortuga is entered, yielding the translation result "tortoise. The translation result is then followed by the token prediction, "It is a tortoise.

The fourth example uses the Wikipedia Search WikiSearch API. It calls WikiSearch following the words, "The Brown Act is a law of the State of California." Searching for the Brown Act, we get the search result, "The Ralph M. Brown Act is a law enacted by the California Legislature that guarantees the public the right to attend and participate in the meetings of local legislative bodies." and gets the search result, "We need a legislative body like the City Council that holds meetings open to the masses," with the token prediction.

Thus, if you remember to insert [API name (input to API)]→API output in the text, you can call the various tools at the right time and reflect it in your next forecast.

Creating a dataset with text for API calls

We explained earlier that for a language model to learn API calls, API calls must be represented as text. In other words, for a language model to learn API calls, a dataset must be created in which the API call text is inserted at the appropriate time. Therefore, the proposed method proposes to create this dataset using in-context learning and probabilistic prediction of the language model.

Flow of creating a dataset using an example API call for a question and answer system

The flow of data set creation is shown in Figure 2. First, assume that we have some set of sentences. The sentences are divided into two parts. We will create the text of an API call to be inserted between these two split sentences. First, we sample API calls from the language model and execute the API calls. We then select the appropriate API calls. The selected API call text is then inserted into the original sentences. The dataset is now ready to be used.

Figure 2: Flow of the proposed method (example of API for question-and-answer system)

Specific examples of original text

Specifically, suppose there is a sentence "Pittsburgh is known as this, the City of Steel and". We would split this sentence into "Pittsburgh is known as this" and "with the City of Steel".

Sample concrete examples of API calls

During this time, as input to the question and answer system API call to be inserted, "What is Pittsburgh also known as?" and "In which country is Pittsburgh located?" sample the following. Specifically, the following prompts are given to the language model

Your task is to add a question and answer system API call to the text. The question should serve to complement the text; the API can be called by writing "[QA (question)]", where "The question is the question you want to ask. An example of an API call is as follows Input: ..., Output: ..."

Thus, role prompting, which defines roles, and fuchot prompting, which gives a few examples, allow the language model to generate input for API calls.

Concrete examples of executing API calls

We get "City of Steel" and "United States" as output when actually entered into the question and answer system API calls, respectively.

Specific examples of screening API calls and granting optimal API calls

To screen API calls, we compare "with API calls and their results added" and "without API calls".

For example, the input to the API and its output, "What is Pittsburgh also known as? City of Steel" with the prefix "Pittsburgh is known as" is input to the language model, and the probability of making the token prediction "City of Steel and" is calculated. Here, the original sentence is considered to be correct in a sense, so by prefixing the input to the API and its output, we can determine that the higher the probability of the token "with the city of steel" coming, the more correct the prediction.

Therefore, if the probability improves compared to the "no prefix" case, then the API call was worthwhile. Here are some questions to consider: "What is Pittsburgh also known as? City of Steel" and "In which country is Pittsburgh located? United States," compared to the preamble, "Pittsburgh is also known as what? City of Steel" prefix increases the probability of predicting "with the City of Steel," so the "Pittsburgh is also known as what? The API input/output "City of Steel" is determined to be the optimal "API call and its result".

Therefore, the most appropriate API calls can be selected by examining what inputs were made to which APIs and where the input/output results were inserted in the sentence to see if this probability increased the most. By adding the most appropriate API call inserted into the original sentence to the dataset, a dataset with the API call text is completed.

Training of the created dataset (fine tuning of the language model)

The dataset created will be used for fine tuning the language model. The paper only gives a brief description of fine tuning as "fine tuning using the objective function of a standard language model. It seems that it is sufficient to train the dataset using some method used for language model fine tuning, and no elaboration seems to be necessary.

The paper additionally emphasizes that in addition to the benefits of fine tuning with regular data sets, one can learn when and how to use the tool. It seems to be trying to say that there is no harm in doing it anyway.

Inference with fine-tuned language models

When reasoning with a fine-tuned language model, let the language model predict tokens as usual until it generates ->. Once -> is generated, interrupt token generation, recognize the appropriate API call from the text, and actually execute the API call tool." Insert "Execution Result" and "]" to resume token prediction.

Toolformer evaluation results

The results of evaluating the Toolformer of the proposed method are shown in Figure 3.

Figure 3: Toolformer evaluation results

This assessment is based on three questions: LAMA, Math Benchmarks, and QA Benchmarks. LAMA here is a task to complete short sentences that lack facts (dates and places); Math Benchmarks is a math question; QA Benchmarks is a question that looks at the accuracy of the answers to a question; and QA Benchmarks is a question that looks at the accuracy of the answers to a question.

The comparison targets are Toolformer with the proposed method, Toolfomer (disabled) and GPT3, which are fine-tuned datasets with the same base model as the proposed method but without API call texts. (multiple versions with different number of model parameters are available) and GPT-J, both of which have smaller parameter sizes than GPT3.

The horizontal axis in Figure 3 is the number of model parameters and shows the difference between Toolformer, Toolformer (disabled) by the number of model parameters. Specifically, 124M, 355M, 775M, 1.6B, and 6.7B (B is in billions and M is in millions). Incidentally, GPT-3 is 175B. The vertical axis is the performance of the language model, the larger the better.

Comparison of GPT-3 and Toolformer

In LAMA, Toolformer outperforms GPT3 at 6.7B model parameters, and in Math Benchmarks, Toolformer outperforms GPT-3 at 1.6B model parameters and above. This suggests that detailed fact-checking and computation are not well suited for large language models, and that simply supplementing these functions with tools can improve performance.

On the other hand, Toolfomer is inferior to GPT-3 in QA Benchmarks. This paper discusses the reason for this as the low performance of the search engine used as the tool in this study, which often did not provide appropriate search results for the given questions.

The study also considers that this is because interaction with search engines is necessary to master the use of search engines, but this has not been done in this case. In other words, when people use search engines, they usually change their search terms based on the search engine results, and if there is no appropriate information on the top page of search results, they check the second and third pages in turn. If such interactions were possible, we believe that the performance of the GPT-3 could have been comparable to that of the GPT-3.

Toolformer vs. Toolformer (disabled)

In LAMA, Math Benchmarks, and QA Benchmarks, Toolfomer consistently performs as well as or better than Toolformer (disabled). This would seem to indicate that there are no side effects of assigning API call text to the dataset, and that there is a decent effect of using external tools via API calls. Also, there was no performance difference when the number of model parameters was 124M. Too few model parameters suggests that the tool cannot be used in the first place, or that the results of the tool execution cannot be understood.


In this case, it was shown that a dataset with API call text could be created by leveraging the in-context learning and probability prediction of the language model itself and fine-tuning the language model to leverage the tool through API calls. As a result of the language model being able to leverage the tool, it was able to achieve the same or better performance as larger language models.

Some of the results showed that even with the use of tools, the performance was not as good as for large-scale language models, but the utilization of tools in this case was simple, and the paper discusses that there is still room for more successful use of tools.

Also, this time, the idea that if anything can be made into text, language models can be learned is very powerful, and the potential of language models is still being explored.

We are very much looking forward to further development of research on tool utilization with language models in the future.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us