LLM Learning From Failures, Proposing A New Benchmark "COTERRORSET
3 main points
✔️ Introducing a new benchmark, COTERRORSET
✔️ Introducing a new learning method for large language models to learn from their own mistakes
✔️ Detailed analysis and categorization of errors toanalyze their contribution to model learning and inference accuracy
Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning
written by Yongqi Tong, Dawei Li, Sizhe Wang, Yujia Wang, Fei Teng, Jingbo Shang
(Submitted on 29 Mar 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Large-scale language models have received considerable attention inrecent researchfor their inferential capabilities. These models have achieved remarkable results in a wide variety of tasks and applications, and in particular, the effectiveness of approaches using the Chain-of-Thought (CoT) prompting method has been confirmed. This method follows a step-by-step approach to problem solving that mimics human logical thinking.
Just as humans learn from pastmistakes,it is important forlarge-scale language modelsto look back and learn from their mistakes. So far, however,therehas beenlittle research on how large-scale language models learn from their mistakes. Therefore, this paper focuses on this point and studies new ways of learning.
The study builds a large dataset, COTERRORSET, containing 609,432 questions from 1,060 different tasks. Each question is built from manually curated correct references and incorrect rationales collected from PaLM2 responses.Inaddition,we reflect the reasons for such errors by prompting thelarge-scale languagemodel withthecorrectreferences andincorrect answers.Based on this,we analyze how thelarge-scale languagemodel makes mistakes. In addition, we have introduced two new approaches, "mistake tuning" and "self-rethinking," to take advantage of these mistakes and improve the ability of large-scale language models, andto facilitate the learning process oflarge-scale languagemodels, wehave added [CORRECT RATIONALE] and [INCORRECT RATIONALE] prefixes are introduced. This goes beyond traditional supervised learning and proposes a way to make better use of mistakes.
Themethod proposed in this paperhelps alarge-scale languagemodel distinguish between correct and incorrect evidence, and experiments with contrasting samples are used to further deepen its learning process. This allows the model to reconsider and revise its own responses after initial responses.Theapproachalsosets a threshold on the number of times amodelcan"self-rethink"and"make tuning"inorder to manage computational resources and prevent potential loops.
The figure below is an overview of the pipeline.
Inaddition, to gain a deeper understanding of how these models learn from mistakes and their ability to do so,we have experimented with avariety of inference tasks andlarge languagemodels ofvarious sizes, which have consistently shown to improve performance.Themethodallows for the effective exploitation of errors in both the tuning and inference stages of large-scale language models, suggesting its broad applicability and effectiveness.This further extends the availability and effectiveness of large-scale language models, and further research is warranted.
COTERRORSET Overview
In this paper, we construct a new benchmark called "COTERRORSET" to investigate the impact of false rationales on the inferential performance of large-scale language models. The dataset covers a wide variety of problem domains (multiple-choice QA, extractive QA, closed-book QA, formal logic, natural language reasoning, and arithmetic reasoning) and is built upon COTCOLLECTION (Kim et al., 2023).
The questions and references in this data set are drawn from several existing data sets, including
- QASC (Khot et al., 2020)
- AQuA (Ling et al., 2017)
- GSM8K (Cobbe et al., 2021)
- QED (Lamm et al., 2021)
- StrategyQA (Geva et al., 2021)
- SenseMaking (Wang et al., 2019)
- CREAK (Onoe et al., 2021)
- eSNLI (Camburu et al., 2018)
- ECQA (Aggarwal et al., 2021)
These data are systematically organized, and each task incorporates correct and incorrect responses, as well as a demonstration of why the error occurred. These errors and demonstrations are generated using PaLM2.
Unlike traditional CoT datasets, COTERRORSET utilizes mistakes and the rationale behind them with PaLM2. For each question in the dataset, PaLM2 is used to specifically collect the rationale behind the error and provide it along with the correct answer, reflecting in detail how the model makes mistakes. The figure below outlines this process.
Systematic collection of erroneous evidence has the potential to pave the way for future improvements from a new perspective. Specific examples are shown in the table below.
In addition, a detailed analysis of the COTERRORSET dataset shows that the error types included are very complex and diverse. This complexity poses a challenge for model improvement. To address this, we have introduced an unsupervised clustering approach that leverages a large-scale language model, as shown in the figure below.
This technique makes it possible to classify the various error types into more general categories.First, keywords that cause errors are identified and extracted. Next, these keywords are fed into a large-scale language model, prompting it to form general categories that encompass the entire error. After this automated clustering process, we manually scrutinize each cluster and make adjustments as needed to refine the matching results.Ultimately, error types are merged into several abstract categories, such as "Computational Error," "Numerical Error," and "Logical Error" in arithmetic reasoning, and "Logical Error," "Common Sense Error," "Verbal Error," and "Context Error" in common sense reasoning. An overview is given in the table below.
Experimental results
Self-rethinking" was found to be significantly more effective in improving the performance of the GPT-4 and PaLM2 models compared to the standard Chain of Thought (CoT) method.Theresults are shown in the table below, whichshows PaLM2 performance when using this method.
Thetable below alsocompares it to GPT4 performance.In particular,the improvement when"self-rethinking" is used stands out, indicating that this is an effective approach to improving GPT-4 performance.
Unlike "self-consistency," this approach achieves high accuracy while minimizing the number of inferences. Specifically, it performs two to three inferences on a question, and if errors are found, the errors are used to derive the final answer. This allows for more efficient problem solving with fewer computational resources than self-consistency.
In particular, several datasets - GSM8K, AQuA, MathQA, and LogiQA -show superior results compared to"self-consistency" at thesame computational cost. However, the MathQA dataset, which is dedicated to operation-based arithmetic problems, did not exceed the results of self-consistency, but did outperform CoT. Thissuggests that"self-consistency" is effective for certain problem types, especially complex mathematical problems.
As a demonstration of the effectiveness of "self-rethinking," the table below shows the results of an 8-shot experiment on four tasks using the PaLM2 model: GSM8K, AQuA, MathQA, and LogiQA.
The process collected the erroneous rationales generated by PaLM2 and used them as a demonstration of learning and rethinking. The results confirm that "self-rethinking" has distinct advantages over standard 8-shot CoT. This technique is particularly effective in improving accuracy in few-shot learning scenarios that require complex problem solving.
This study also utilizes "self-refine," but differs from "self-rethinking" in that it does not utilize a sample of previous mistakes. Nevertheless, "self-rethinking" significantly outperformed "self-refine" on most datasets.Inparticular, the math reasoning (MathQA) data set showed improvement, while the common sense reasoning(LogiQA)data set showed a decrease in performance. In contrast, "self-rethinking"consistently outperformed the 8-shot CoT ina variety of domains, suggesting that incorporating previous errors has a stabilizing effect on the rethinking and elaboration process.
In general, "self-rethinking" allows models to identify fixed logical patterns and learn from errors, especially in situations where logical rigor is required. This is especially useful in tasks that require strong logic and are prone to minor errors. In addition, this approach helps to identify and correct low-level errors and misunderstandings that are within the model's potential but are often overlooked. This capability has shown to serve as a valuable tool for improving the accuracy and reliability of answers in large-scale language models, especially in the context of complex problem solving.
Further results on "mistake tuning" are presented in the table below. This table highlights the impact of the Flan-T5 model combining incorrect and correct evidence and shows the performance at different model scales.
In particular, Flan-T5-large (780M) in the MathQA domain outperformed PaLM2's 41.37% with an accuracy of 48.95%, demonstrating the effectiveness of this method. This result provides an important indication that large-scale language models can improve problem solving and reasoning ability by leveraging incorrect inferences. Furthermore, this approach not only enhances the understanding of correct CoTs, but also extends the ability to identify and learn from faulty evidence.
The results suggest a new direction for further development of the inference process by not only enhancing the understanding and learning of correct CoTs, but also facilitating the ability to identify and learn from incorrect evidence. Such an approach could be an important tool for improving the accuracy and reliability of large-scale language models, especially when solving complex problems.
Summary
This paper examines whether large-scale language models can learn from their own mistakes.To understand how large-scale language models identify and learn from their mistakes, we are developing a new benchmark, COTERRORSET, which includes correct and incorrect evidence. This benchmark was designed through demonstrations that show the process of error creation and collects data in different domains.
Wealsopropose two approaches toevaluate the impact of errors from different perspectives:self-rethinkingandmake tuning. These approaches consistently show significant improvements and reveal the potential benefits of learning from reasoning errors. In particular, we provide a detailed analysis of common errors made by large-scale language models in the domains of arithmetic and common sense reasoning, providing clear guidance regarding directions for future research.
Categories related to this article