FacTool: A New Framework For Verifying The Reliability Of Information Generated By Large-scale Language Models

Large Language Models 07/11/2023

3 main points
✔️ A new framework, FACTOOL, is proposed to fact-check information generated by large-scale language models.
✔️ Detects errors in information generated by large-scale language models using various tools (Google Search, Google Scholar, Python, etc.).
✔️ It is applicable to a wide variety of tasks and scenarios and is highly scalable.

FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
written by I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu
(Submitted on 25 Jul 2023 (v1), last revised 26 Jul 2023 (this version, v2))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Large-scale language models have shown high performance in many tasks and their use is growing rapidly. On the other hand, there is a risk that the generated text contains incorrect information. This is a particular challenge for implementation in fields related to human life and vitality (e.g., medicine, finance, law, etc.). Technology is needed to enable large-scale language models to detect erroneous information in the generated text and to confirm the usefulness and reliability of the generated text.

Although there has been research on detecting incorrect information in text generated by large-scale language models, all of this research is specific to a particular task and is not sufficient to be applied to large-scale language models, which are highly general-purpose.

Therefore, this paper proposes a framework called "FACTOOL" that checks whether the text generated by a large-scale language model is correct, independent of task and domain. This framework uses various "tools," such as Google search and Python, to check whether AI-generated text is correct.

What is FACTOOL?

FACTOOL is a scalable framework that can be combined with various tools to check the correctness of the content generated by a large-scale language model. It consists of five steps as shown in the figure below.

Claim extraction: Extracts key points (claims) from text generated by a large-scale language model.
Query generation: Generate a query to gather evidence for its main points (claims) using a suitable tool.
Tool use: Input the generated query into a suitable tool.
Evidence collection: Gather information on the evidence obtained by the tool.
Matchverification: Verify that the evidence collected is consistent with the main points (claims).

The paper examines the usefulness of FACTOOL in four tasks: "knowledge-based QA" to answer questions based on existing knowledge, "code generation" to generate new program code, "mathematical problem solving" to solve mathematical problems, and "scientific literature review preparation" to summarize scientific papers.

Experiment 1: Evaluation of claim extraction

We will now present some of the experiments conducted in this paper. First, here we introduce the performance evaluation of the first step of "FACTOOL": the extraction of claims. We evaluate how accurately claims can be extracted from the generated sentences. The "RoSE" dataset is used here. This dataset contains a set of sentences and a set of "claims" extracted by experts from those sentences. Here we test the extent to which the claims extracted by these experts match the claims extracted by FACTOOL.

Three different models, GPT-4, ChatGPT, and Flan-T5, are used to extract claims by FACTOOL, and the similarity between the claims extracted by each model and those extracted by the experts is measured by four indices (ROUGE-1, ROUGE-2, ROUGE-L, BERTScore). The results are shown in the table below, indicating that the claims extracted by FACETOOL are quite consistent with those extracted by the experts.

Experiment 2: Evaluation of the "FACTOOL" framework

Here we evaluate the performance of FACTOOL using 2ChatGPT and FACTOOL using GPT-4 for four tasks: knowledge-based QA, code generation, mathematical problem solving, and scientific literature review preparation.

In addition, two criteria are available to evaluate how well a large-scale language model can determine the accuracy of information in FACTOOL. The criteria are "Self-Check with 3-shot CoT" and "zero-shot CoT". These two are ways for models to check themselves to see if their output is correct. You tell the model to find where you are wrong, explain it, and correct it. The difference between the two is how many examples you show the model, i.e., how much "demonstration" you provide. In "Self-Check with 3-shot CoT," the model is shown three examples and then asked to solve the problem. In the "zero-shot CoT," the students are asked to solve the problems without showing any examples. They are labeled Self-Check (3) and Self-Check (0), respectively. These criteria measure how well the model can judge the accuracy of its own output and how many examples it needs to do so. Accuracy, repeatability, fit, and F1 scores are reported at both the claim and response levels. Performance on each task is shown in the table below.

First, FACTOOL with GPT-4 shows the best performance in all test scenarios (knowledge-based QA, code generation, math problems, and scientific literature review): 89.09 for claim-level F1 and 71.79 for response-level F1 in KB-based QA, claim-level and response level F1 of 92.11, respectively; for math problems, 98.97 for claim level F1 and 80.36 for response level F1; and for scientific literature review, 95.24 for claim level F1 and 94.74 for response level F1. These numbers are the highest for each task.

In addition, FACTOOL with GPT-4 outperforms Self-Check in all scenarios. This indicates that FACTOOL can more accurately assess facticity beyond the AI's ability to identify and correct its own errors. In particular, FACTOOL with GPT-4 significantly outperforms Self-Check in scientific literature reviews. This indicates that Google Scholar is very robust in the specific task of finding citations when compared to large language models.

Furthermore, we find that FACTOOL with GPT-4 outperforms FACTOOL with ChatGPT in all scenarios. In particular, the query generation and agreement validation portions of the knowledge-based QA task are difficult for ChatGPT and relatively easy for GPT-4, resulting in an assertion level F1 score of 89.09 vs. 81.25 and a response level F1 score of 71.79 vs. 52.63.

Overall, we find that FACTOOL is a useful tool for checking whether text is factual, with the GPT-4 performing better overall.

Experiment 3: Using FACTOOL to evaluate the facticity of chatbots

Here we examine whether what various chatbots answer in FACTOOL using GPT-4 is factual.Five chatbots are included in the study here: GPT-4, ChatGPT, Claude-v1, Bard, and Vicuna-13B.

For each chatbot, several prompts (questions or instructions) were presented and the responses were evaluated in FACTOOL. The prompts were chosen from four tasks: "knowledge-based QA," "code generation," "mathematical problem solving," and "scientific literature review writing," of which "knowledge-based question answering" is the most common scenario, so we have three times as many prompts to validate.

FACTOOL is used to evaluate whether the answers generated by the chatbot are factual (claim accuracy) and whether the answers are appropriate overall (answer accuracy). This evaluation is weighted, with the "knowledge-based question-answer" responses given greater weight than the other tasks. This weight is determined by the ratio of the number of prompts in each scenario.

The results are shown in the table below. It can be seen that the GPT-4 scored the highest in both claim accuracy and response rating. This indicates that the GPT-4 generates the most factual responses and also provides the most appropriate responses overall.

Summary

The paper presented here proposes a new framework, FACTOOL, for checking whether the information generated by large-scale language models is factual. However, since large-scale language models are versatile and can be used for various fields and tasks, and can generate long sentences, their fact checking is not easy. Therefore, FACTOOL addresses these issues by combining five steps: claim extraction, query generation, tool use, evidence collection, and match verification. and various other tools to perform fact-finding and demonstrate their usefulness.

FACTOOL is a versatile framework that can be applied to a variety of tasks, including general knowledge question and answer (QA), code generation, mathematical problem solving, and scientific literature review, and can be extended to many more scenarios. Fact-checking will become increasingly important as the use of large-scale language models increases, and a fact-checking framework such as FACTOOL may be rapidly in demand in the future.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

FacTool: A New Framework For Verifying The Reliability Of Information Generated By Large-scale Language Models

Summary

What is FACTOOL?

Experiment 1: Evaluation of claim extraction

Experiment 2: Evaluation of the "FACTOOL" framework

Experiment 3: Using FACTOOL to evaluate the facticity of chatbots

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...