Catch up on the latest AI articles

Redefining The Role Of Large-scale Language Models, A New Approach To Planning And Inference Tasks With The LLM-Modulo Framework

Redefining The Role Of Large-scale Language Models, A New Approach To Planning And Inference Tasks With The LLM-Modulo Framework

Large Language Models

3 main points
Point out the limitations of large language models in autonomous reasoning and perfect plan generation
✔️ Propose ways to use large language models for advanced cognitive tasks and as an auxiliary tool for problem solving
✔️ Through a new framework "LLM-Modulo", the ability of large language models to plan and Proposes ways to integrate the capabilities of large-scale language models into planning and reasoning problem solving through a new framework, LLM-Modulo

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
written by Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, Anil Murthy
(Submitted on 2 Feb 2024 (v1), last revised 6 Feb 2024 (this version, v2))
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


The large-scale language models that are currently garnering attention are trained onvast amounts ofdatagathered from across the Internetand demonstrate remarkable linguistic capabilities. And these models are expected to be able to handle not just text generation, but also advanced cognitive tasks such as complex planning and inference.However, a number of recent studies on large-scale language models have gradually revealed their limitations. In practice, they may be able to generate instantaneous next words, but they do not perform the underlying reasoning autonomously.It essentially functions as a giant "System 1" (see below), specializing in predicting instantaneous next words rather than making principled inferences.

Nevertheless, research continues on the inferential capabilities of large-scale language models. While some argue that large-scale language models should only be used as "high-level translators," large-scale language models have the potential to be much more than that. As an approximate, if not perfect, source of information that reflects human knowledge, they can be a valuable resource, especiallyin solving"System 2(see above)" tasks.

This paper aims to explore what role large-scale language models can play in planning and reasoning tasks and how useful they can be. Focusing specifically on planning tasks studied in the automated planning community, the paper argues that while large-scale language models cannot plan on their own, they can potentially provide effective assistance when combined with plan generation and external model-based verification. In other words, the paper emphasizes the benefit of using large-scale language models as an aid to humans and other systems, rather than as a planning and validation entity. And to achieve this, the paperproposes a new framework for planning and reasoning, LLM-Modulo.

By clearing up misconceptions aboutlarge-scalelanguagemodelsand understanding their true capabilities and limitations,large-scale languagemodels can be used more effectively to help solve more complex problems. This paper is expected to provide a realistic view of the evaluation oflarge-scale languagemodels, which oscillates between over-expectation and under-estimation.

Limitations of Large-Scale Language Models

This paper also addresses the limitations of large-scale language models in their ability to plan and self-verify. While there were mixed expectations and much optimism immediately after the release of large-scale language models, recent research has cast doubt on the ability of large-scale language models to plan autonomously and feasibly.

In fact, even state-of-the-art large-scale language models such as GPT-4have been found to beable to executeonly 12% of thegenerated plans without error. This result is consistent regardless of the version of the large-scale language model.

Furthermore, the performance of the large language model is further degraded when the names of actions and objects in the domain are changed.These changes have no effect on the performance of the standard AI planner. Thisfurther indicates that the large-scale language model is more likely to be performing approximate searches of the plan than the plan.

Large-scale language models have also been shown to be limited in their ability to validate their plans and improve them through self-criticism. Large-scale language models are thought to have the potential to improve accuracy through iterative prompting, even when they cannot produce the correct solution at once, but this idea is based on the assumption that verifying accuracy is easier than generating it. However, there are critics of this assumption. In particular, the complexity of the inference task does not affect the performance of a large-scale language model if the model is performing approximate searches.

Recent research has shown that large-scale language models have limited ability to critique and improve upon their own solutions. For example, in solving the graph coloring problem, it has been shown that large-scale language models are not good at solving the problem in direct mode, nor are they good at verifying their answers. Furthermore, it has been reported that self-criticism of one's own answers in iterative mode can lead to incorrect answer choices due to the inability to recognize the correct coloring, thus worsening performance.

These results indicate that large-scale language models have difficulty generating their own plans to self-improve, self-critique and improve them, and then use them to fine-tune themselves. These findings highlight the plan-centric limitations of large-scale language models and provide important points to consider in future research and development.

Also, recent studies have shown that large-scale language models cannot guarantee perfect planning or its validation, and a deeper look at this point will help us understand why there are so many such criticisms in the literature. With regard to planning, creating a workable plan requires the right knowledge and the ability to assemble it. However, there are many cases where people confuse the general planning knowledge provided by large-scale languagemodels with executable plans. For example, abstract plans such as "planning a wedding" are easily confused because they are not actually intended to be executed. In fact,even studies that suggest that large-scale languagemodels have planning capabilities ignore interactions among subgoals in certain domains or tasks, or "modify" them with human intervention.

However,there are ways to effectively utilize large-scale language models. With humans in the loop to validate and refine models,large-scale languagemodels can be a rich source of information about world dynamics and user preferences.

Even with respect to self-verification,the capabilities oflarge-scale languagemodels are limited. For certain tasks, verifiers are nearly impossible or dependent on external verification. For example, approaches such as Thoughts of Tree (ToT)rely on iterative back-prompting by the large-scale language modeland continue until a solution is found that is acceptable to the external verifier, but is essentially just problem-specific prompt priming. Ultimately, the soundness of the external verifier is the key to assurance. However, this requires considerable effort.

In response to these challenges, principleframeworkssuch as "LLM-Modulo"have been proposed. Thisbrings a new trend in the use oflarge-scale languagemodelsas a source of knowledgeand shows similarities with past knowledge-based AI systems. Large-scale languagemodels provide a new way to enable certain humans to acquire problem-specific knowledge without inconvenience. However, the question of "how to plan robustly" remains. Through a holistic approach and framework, it is important to understand the limitations of large-scale languagemodels and seek ways to go beyond them.

Robust planning with LLM-Modulo framework

To answer some of the big questions in the area of planning and reasoning, we highlight here the "LLM-Modulo" framework. It questions the ability of large-scale language models to plan and reason on their own, whilehighlighting the constructive role that large-scale languagemodels play insolving planning and reasoning tasks. The ability oflarge-scale languagemodelsto generate surprising ideas and potential solutions, combined with model-based verifiers and experts, opens up new possibilities. The figure below represents a conceptual diagram of the LLM-Modulo framework.

The framework provides an effective approach to a wide variety of planning and reasoning tasks and focuses on problems that have long been addressed by the automated planning community. The basic structure is a simple but powerful "generate-test-critique" loop in which a large-scale languagemodel generates candidate plans from a problem specification, which are then evaluated by a critic. Notably, the plans generated by the large-scale language model are guaranteed sound by external critics, resulting in higher quality synthetic data andhelping to further improve the large-scale languagemodel.

Design considerationsemphasize the"generate-test" model, where the large-scale languagemodel interacts directly with external critics. In this way, the large-scale language model is responsible for inferring and generating solutions that satisfy the critics. We alsorecognize that large-scale languagemodels can contribute not only to candidate plans, but also to domain models, problem reduction strategies, and problem specification refinement. By leveraging these capabilities, large-scale languagemodels can play a variety of roles in the planning process.

Finally, the architecture carefully limits the human role,incorporatinginteractions withthe large-scale language model by domain experts to elicit models and a process wherebyend userswork with the large-scale languagemodel to refine the problem specification. Direct human involvement in the inner loop of planning is avoided, thereby providing an efficient and workable solution to complex planning problems.

At the heart of this LLM-Modulo framework are "critics" who evaluate the solutions generated by the large-scale language model to the planning and reasoning problems. These critics review the suitability of proposed plans using strict and flexible constraints. Strict constraints include factors that validate the accuracy of the plan, such as causality, timeline accuracy, and proper use of resources. In particular, VAL, a well-known model-based verification method, can be used in PDDL planning problems. Flexible constraints, on the other hand, take into account more abstract factors such as style, explainability, and user preferences.

In this framework, the large-scale language model cannot directly play the role of a rigorous critic, but there is room for it to contribute by mimicking some of the features of a flexible critic. This also allows the style critic to be based on the large-scale language model, thus ensuring that the overall soundness of the framework is guaranteed from the critic's evaluation criteria.

Critics will evaluate the suitability of the candidate plan using both rigorous (model-based) and flexible (possibly large language model-based) criteria. If all rigorous critics agree with the current plan, it is offered to the end user or executor as a valid solution. If unsatisfactory, critics can range from simple "try again" feedback to detailed feedback pointing out specific problems.

Large-scale language models also serve as "reconstructors" within the LLM-Modulo framework. Since many symbolic model-based verifiers operate in specialized formats, the proposed plan must be converted to these specialized representations. The reconstructor module assists in this conversion process. Large-scale language models are adept at reformatting between different syntactic representations, and this ability can be leveraged to help prepare input for the verifier.

The role of the back-prompt (meta) controller in planning and reasoning task resolution is critical. This system centralizes feedback from diverse critics andprocesses it into improved prompts that allowlarge-scale languagemodels to generate new ideas and solutions. Especially in situations where there is a mix of flexible and rigorous critics, this meta-controller aggregates the critiques into consistent feedback, resulting in more accurate results.

The processing steps of the back-prompt controller range fromsimple round-robin selection to the creation of summary prompts with the assistance of the large-scale language model, as well as the application of prompt diversification strategies. This allows the large-scale language model to search for the next candidate solution from different regions of the implicit search space. This approach is similar to strategies such as the tree of thoughts (ToT) prompting system, which facilitates the exploration of a wider range of possibilities.

The framework also solves the planning problem, then adds the resulting solution to the synthetic data corpus, which is then used to fine-tune the machine learning model (Figure (6, 7) below). This cycle aims to improve the accuracy of future problem solving.

Behind this approach is the widely accepted principle that fine-tuning based on task-specific data can improve AI's reasoning and planning capabilities. For example, fine-tuning a model with a block-world problem solution can lead to more accurate solutions to similar problems.

However, the attractive aspects of this technique also present significant challenges. In particular, the source of the data used for fine-tuning. One innovative solution is proposed in the LLM-Modulo framework, whereby the machine learning model itself generates synthetic data and performs self-tweaking based on that data. This allows the model to form a self-improvement loop and incrementally improve its inference performance.

However, the challenge is that AI cannot fully validate its own solutions. In the past, it has been common to use external plan generators to generate reliable synthetic data; the LLM-Modulo framework addresses this issue by providing a new method to generate synthetic data with guaranteed accuracy by leveraging an AI-based framework The LLM-Modulo framework is a new method to generate synthetic data with guaranteed accuracy.

As mentioned earlier, we avoid human involvement in iterative prompts to large language models. This is because it is too time consuming for humans. Instead, we manage the plan critique process using an automated verifier for model-based or large-scale language model support. This framework relies on "once per domain" and "once per problem" human interaction.


This paper offers a new perspective on the current state of the art rather than undue optimism or pessimism about the potential of large-scale language models in tasks such as planning and reasoning.It argues that although large-scale languagemodels do not have the ability to plan on their own, they can be key players in planning task resolution when combined with reliable external models. Their primary role is to provide coarse-grained knowledge and to present feasible planning alternatives.

It critiques previous claims that planning and self-verification can be done withlarge-scale language models alone, and delves into why they can be misleading. He also points out how rough knowledge acquisition and its confounding effects on the process of creating a viable plan.

Going further, we propose the LLM-Modulo framework as a method to combine the idea generation and knowledge provision capabilities oflarge-scale languagemodels with external verifiers to create more robust and expressive plans. This framework is an approach that goes beyond the limitations of traditional symbolic planners while retaining their certainty.

The papersuggests the potential for new "neuro-symbolic" architectures, similar to successful examples such as AlphaGeometry and FunSearch. These examples show that the LLM-Modulo framework could play an important role in the future of planning and reasoning.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us