Catch up on the latest AI articles

Tree Of Thought: Incorporating System 2 Concepts Into LLM, Even More Powerful Than CoT-based GPT-4

Tree Of Thought: Incorporating System 2 Concepts Into LLM, Even More Powerful Than CoT-based GPT-4

Large Language Models

3 main points
✔️ Inspired by the System 2 concepts sought by Yoshua Bengio et al, exploration, strategic foresight, and self-assessment have been incorporated into LLM
✔️ It performs multiple-choice exploration via ToT (tree of thoughts) as opposed to CoT (chain of thoughts) and shows transcendent performance on tasks difficult for GPT-4 It has shown transcendent performance even on tasks that are difficult for GPT-4.
✔️ It can be combined with commercial LLMs, and has potential applications due to its low computational cost for a complex system.

Tree of Thoughts:Deliberate Problem Solving with Large Language Models
written by Shunyu Yao,Dian Yu,Jeffrey Zhao,Izhak Shafran,Thomas L. Griffiths,Yuan Cao,Karthik Narasimhan
(Submitted on 17 May 2023)
Comments: Code repo with all prompts: this https URL
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


One of the reasons why LLMs such as ChatGPT (GPT-3.5, GPT-4) are able to respond to questions as if they were being answered by a human (with extensive knowledge), incorporating the operator's intentions, is that when LLM models exceed a certain level and become large, they are able to understand the chain of thoughts (CoT : Chain of Thoughts). One of the reasons for this is that when the LLM model exceeds a certain level and becomes large, it is able to understand the chain of thoughts (CoT: Chain of Thoughts). Furthermore, by including the chain of thoughts in the instructions to the model, a more desirable response can be obtained (prompt engineering). There have been numerous presentations and studies on prompt engineering and a survey of them has been published.

This paper seeks to add System 2 processing to the System 1 capabilities realized by Chain of Thoughts to allow NLP to perform more human-like reasoning. Specifically, we introduce a framework called Tree of Thoughts (ToT) that extends the capabilities of language models by enabling intentional problem solving through exploration, strategic anticipation, and self-evaluation. This allowed the language model to consider multiple inference pathways, make global choices, and backtrack when necessary, improving its problem-solving abilities in tasks such as 24 games, creative writing, and mini-crosswords.


Language models (LMs) such as GPT and PalM are based on an autoregressive mechanism of text generation in which token-level decisions are made one at a time, left to right. This paper raises the question of whether such simple mechanisms are sufficient to build LMs for general problem solvers and what problems challenge the current paradigm. The paper examines the literature on human cognition as a clue to answering these questions.

In the study of human cognition, dual-process theory suggests that there are two modes in which people engage in decision making: a fast, automatic, unconscious mode ("System 1") and a slower, deliberate mode of consciousness ("System 2"). These two modes have been linked to various mathematical models used in machine learning. (A well-known keynote talk at NIPS2019 by Yoshua Bengio) For example, research on reinforcement learning in humans and other animals has explored situations that address associative "model-free" learning and more deliberative "model-based" planning. token-level by simple association in LM. selection is reminiscent of "System 1," so augmenting it with a more deliberative "System 2" planning process that (1) maintains and explores a variety of options rather than just selecting the current one and (2) actively looks ahead and backwards to evaluate the current situation and make more global decisions We believed that there would be advantages in.

System 1 System 2
intuitive introspective
early (in the day, etc.) late (e.g. "late at night")
unconscious conscious
Response with bias normative response
contextual abstract
automatic regulatory
associative rule-based
empirical consequentialist

To design such a planning process, we returned to the origins of artificial intelligence (and cognitive science) and drew inspiration from the planning process explored by Newell, Shaw, and Simon since the 1950s. Newell and colleagues characterized problem solving as the exploration of a combinatorial problem space represented as a tree. The authors therefore propose the Tree of Thoughts (ToT) framework for general problem solving with language models: as Fig. 1 shows, ToT actively maintains a tree of thoughts, whereas existing methods sample continuous linguistic sequences for problem solving, each thought becomes a coherent linguistic sequence that serves as an intermediate step toward problem solving (Table 1). These high-level semantic units allow the LM to self-evaluate how the different intermediate thoughts are progressing toward problem solving through an intentional reasoning process that is also instantiated in the language (Figs. 2, 4, 6). This implementation of the search and discovery method (heuristic) through LM self-assessment and deliberation is novel compared to previous search and discovery methods, which were either programmed or learned. Finally, this language-based ability to generate and evaluate a variety of thoughts is combined with search algorithms such as breadth-first search (BFS) and depth-first search (DFS) to enable systematic exploration of trees of thoughts using look-ahead and backtracking.

For evaluation purposes, we propose three new tasks (24 games, creative writing, and crosswords) that are empirically challenging even with existing LM reasoning methods, including the GPT-4, a state-of-the-art language model (Table 1). These tasks require deductive, mathematical, common sense, and lexical reasoning skills, as well as methods that incorporate systematic planning and exploration.

The ToT was shown to significantly increase the problem-solving ability of language models in these tasks. The authors also analyzed how different alternatives affect model performance through systematic carving, and discussed future directions for better training and use of language models.


First, we list some existing methods for using large-scale language models to solve problems.

Input-output (IO) prompts are the most common way to convert input x to output y: y ∼ pθ( y|promptIO (x)), promptIO (x) wraps input x with a few shots of task instructions or input-output examples. For simplicity, let prompt θ (output | input) = (output | prompt(input)), so that IO prompts can be formulated as y ∼ pIO θ (y|x).

Chain of Thought (CoT) prompts were proposed to address cases where the mapping from input x to output y is non-trivial (e.g., where x is a mathematical question and y is the final numerical answer). The key idea is to introduce a chain of thoughts z1, - - , zn bridging x and y, where each zi is a coherent linguistic sequence that serves as a meaningful intermediate step toward solving the problem (e.g., zi can be an intermediate equation in a math QA). to solve a problem in CoT, each thought zi ∼ pCoT θ (zi | x, z1--i-1) is sampled sequentially, and then the output y ∼ pCoT θ (y|x, z1---n). In practice, [z1--n, y] ∼ pCoT θ (z1---n, y|x) is sampled as a continuous linguistic sequence, leaving the decomposition of thoughts (e.g., whether each zi is a phrase, sentence, or paragraph) ambiguous.

CoT-SC (self-consistent with CoT) is an ensemble approach that samples k independent identically distributed chains of thoughts: [z(i) 1---n, y(i)] ∼ pCoT θ (z1---n, y|x) (i = 1 - - k), and the most frequent output: arg maxy #{i | y (i) = y}, and returns the most frequent output: arg maxy #{i | y(i) = y}. CoT-SC improves CoT because there are generally different thought processes for the same problem (e.g., different ways to prove the same theorem), and by exploring a richer set of thoughts, output decisions can be made more faithful. However, within each chain, different thought steps cannot be explored locally, and the "mode-maximum" discovery method is only applicable when the output space is limited (e.g., multiple-choice QA).

Thinking Tree: Intentional Problem Solving by LM

To understand the highlighted text, it is important to understand the concepts of systematic planning and search. Systematic planning involves breaking down a complex problem into smaller, more manageable parts and developing a plan to solve each part in a logical and organized manner. Search algorithms are used to explore the problem domain and find the best solution. In the context of a language model, systematic planning and searching uses models to generate and evaluate different possible solutions to a problem.

1. decomposition of thoughts Whereas CoT samples thoughts in a coherent form without explicit decomposition, ToT utilizes problem characteristics to design and decompose intermediate thinking steps As shown in Table 1, depending on the problem, a thought may be a few words (crosswords), a line of equations (24 games), a sentence plan of an entire paragraph (Creative Writing). In general, thoughts should be "small" so that the LM can generate a promising and diverse sample, but "large" so that prospects for problem solving can be assessed (e.g., generating a single book is usually "large" and not coherent).

2. given a thought generator G(pθ, s, k) tree state s = [x, z1---i], consider two strategies that generate k candidates for the next thought step:

(a) Sample of independent identically distributed thoughts from CoT prompts (Creative Writing, Fig. 4): z(j) ∼ pCoT θ ( zi+1|s ) = pCoT θ ( zi+1|x, z1---i ) (j = 1 - - k). This leads to a diversity of independent identically distributed samples when the thought space is rich (e.g., when each thought is a paragraph).

(b) Propose thoughts sequentially using "propose prompt" (Game of 24, Figure 2; Crosswords, Figure 6): [z(1), - - , z(k)] ∼ ppropose θ (z(1---k) i+1 | s). This method is more effective when the thought space is more constrained (e.g., when each thought is only a word or line), since it avoids duplication by proposing different thoughts in the same context.

3. state evaluator V (pθ, S) Given a frontier of different states, the state evaluator acts as a discovery method for the search algorithm to evaluate their progress toward solving the problem and to determine which states to continue exploring and in which order. Discovery methods are a standard approach to solving search problems, typically either programmatic (e.g., DeepBlue) or learned (e.g., AlphaGo). The authors propose a third option by using LM to intentionally infer states. Such intentional discovery methods, when applicable, are more flexible than programmed rules and more sample efficient than learned models. As with the thought generator, we consider two strategies to evaluate states independently or together:

(a) Evaluate each state independently: v ( , S)(s) ∼ pvalue θ (v|s) ∀s∈S where the value prompt explains why for state s to produce a scalar value v (e.g. 1-10) or a classification (e.g. certain/possible/impossible) that can be changed to a value heuristically. The rationale for such evaluative reasoning may vary depending on the problem and thought step. In this study, a small number of anticipatory simulations (e.g., quickly checking that 5 + 5 + 14 yields 24, or that "hot l" means "inn" by filling in the " " in " ") and common sense (e.g., 1 2 3 is too small to be 24, or "tzxc" no word begins with "tzxc") to explore the evaluation. The former may promote "good" states, while the latter may help eliminate "bad" states. Such evaluations need not be perfect; they need only be approximate.

(b) Voting across states: v ( , S)(s) = 1[s = s∗] where the "good" state s* ∼ pvote θ (s∗|S) is voted for after intentionally comparing different states in S at the voting prompt. When it is difficult to directly evaluate the success of a problem (e.g., the coherence of a passage), it is natural to instead compare different partial solutions and vote for the most promising one. In other words, "which state to explore" is used as a multiple-choice QA, and the LM sample is used to vote.

Either strategy can prompt the LM multiple times to tally values or replace time/resources/costs with more faithful/robust discovery methods in the voting results.

4. search algorithms Finally, the ToT framework allows plug-and-play with various search algorithms depending on the tree structure.

(a) Breadth-first search (BFS) (Algorithm 1) maintains a set of b most promising states per step. It is used for 24 games and creative writing where the tree depth is limited (t ≤ 3) and initial thinking steps can be evaluated and pruned to a smaller set (b ≤ 5).

(b) Depth-first search (DFS) (Algorithm 2) searches for the most promising state first until the final output is reached (t > T ) or until the state evaluator decides that solving the problem from the current s is impossible (V (pθ, {s})(s) ≤ vth for a value threshold vth) The latter case is the sub-tree from s. In the latter case, subtrees from s are pruned to replace search and utilization. In both cases, DFS backtracks to the parent state of s and continues the search.

The ToT framework is designed to support systematic planning and retrieval by managing a tree of thoughts. Each thought is a coherent linguistic sequence that serves as an intermediate step toward problem solving; the LM can self-evaluate how the various intermediate thoughts are progressing toward problem solving through a deliberate reasoning process that is also instantiated in the language. This allows the LM to consider several different reasoning paths, self-evaluate its choices, and determine its next course of action. It also allows the LM to make global choices, looking ahead or backward as needed.

Conceptually, ToT offers the following advantages as a general problem-solving technique using LM

(1) Generality IO, CoT, CoT-SC, and self-refinement can be viewed as special cases of ToT (i.e., trees of limited depth and width; Fig. 1).

(2) Modularity Starting with the base LM, the thought decomposition, generation, evaluation, and search procedures can be varied independently.

(3) Adaptability Able to respond to differences in problem characteristics, LM capabilities, and resource constraints.

(4) Convenience No extra training is required; pre-trained LMs are sufficient.

That said, the ToT is flexible and can be adapted to different levels of thinking, different methods of generating and evaluating thoughts, and different problem-specific search algorithms. In other words, the framework can be customized to meet the needs of different tasks and problems.


We propose three tasks that are difficult to sample from the state-of-the-art language model GPT-4 using standard IO and CoT prompts.

24 games

The game of 24 is a mathematical reasoning task that uses four numbers and basic arithmetic operations (+-*/) to obtain 24. For example, given an input of "4 9 10 13", the game can output the answer "(10 - 4) * (13 - 9) = 24".

Task Setup We scrape data from and have 1,362 games classified by human solving time from easy to hard, with a subset of relatively hard games indexed 901-1,000 for testing. For each task, success is considered when the output is a valid equation equal to 24 and each input number is used exactly once; the success rate over 100 games is used as the measure.

Baseline We used a standard input-output (IO) prompt with five in-context examples. In the Chain of Thought (CoT) prompt, each input-output pair is augmented with three intermediate equations, each of which operates on the remaining two numbers. For example, given the input "4 9 10 13," the thoughts "13 - 9 = 4 (left: 4 4 10), 10 - 4 = 6 (left: 4 6), 4 * 6 = 24 (left: 24)" are possible. For each game, the IO and CoT prompts are sampled 100 times to see the average performance. We also consider a CoT self-consistency baseline that takes the majority output from 100 CoT samples and an iterative refinement approach over a maximum of 10 IO samples. In each iteration, the LM conditions all previous history so that if the output is incorrect, it "reflects on the error and generates a refined answer".

ToT Setup To incorporate the Game of 24 into the ToT, it is natural to break the thinking down into three steps, each of which is an intermediate equation, as shown in Fig. 2(a), where at each tree node, the number "left" is indicated exactly, prompting the LM to suggest a possible next step Fig. 2(b). The same "suggestion prompt" is used for all three thought steps, but in only one example with four inputs; the ToT performs breadth-first search (BFS), leaving the best b = 5 candidates at each step. as shown in Fig. 2(b), to perform deliberate BFS in the ToT, we ask the LM to suggest each candidate thought, prompting the LM to evaluate on a "sure/possible/impossible" basis whether or not it reaches 24. The goal is to promote correct partial solutions that can be verified with a few prior trials, eliminate impossible partial solutions based on "too big/too small" common sense, and make the rest "maybe". Values are sampled three times for each thought.

Results As shown in Table 2, the IO, CoT, and CoT-SC prompt methods performed poorly in this task, achieving only 7.3%, 4.0%, and 9.0% success rates. In contrast, ToT with a width of b = 1 already achieved a success rate of 45%, and b = 5 achieved 74%. We also consider the oracle setting of IO/CoT by calculating success rates using the best of k samples (1 ≤ k ≤ 100).To compare IO/CoT (best of k) and ToT, we calculated the tree nodes visited per task for ToT over b = 1 - - - 5, Fig. 3 (a ) to consider mapping the five success rates. Not surprisingly, CoT scales better than IO, with the best of 100 CoT samples achieving a 49% success rate, but still much worse than exploring more nodes in ToT (b > 1).

Error Analysis Fig. 3(b) breaks down at which step the CoT and ToT samples failed the task, i.e., whether the thought (in CoT) or all b thoughts (in ToT) are invalid or impossible to reach 24. It is noteworthy that about 60% of the CoT sample already failed the task at the first step, i.e., generating the first three words (e.g., "4 + 9"). This highlights the problem with direct left-to-right decoding.

Creative Writing

The next task is a creative writing task where the input is four random sentences and the output must be a coherent sentence consisting of four paragraphs, each ending with four input sentences. Such a task is open-ended and exploratory and requires a high degree of planning as well as creative thinking.

Task Setup Sampling random sentences from to form 100 inputs. There is no groundtruth passage for each input constraint, and since GPT-4 was found to be able to follow the input constraints most of the time, the authors used two methods: using GPT-4 zero shot prompts to provide a scalar score of 1-10, and using human judgment to They focused on assessing sentence coherence in two ways: by comparing pairs of outputs from different methods. In the former, we sampled five scores for the output of each task and averaged them. We find that the five scores are usually consistent, with a standard deviation of about 0.56 on average across outputs. For the latter, we blinded a subset of the authors and compared the coherence of CoT and ToT generated pairs, where the order of the passages was randomly reversed for 100 inputs.

Given the creative nature of the baseline task, both IO and CoT prompts are zero shot. The former prompts the user to generate coherent sentences directly in the presence of input constraints, whereas the latter prompts the user to first make a simple plan and then write a sentence, i.e., the plan is the intermediate thinking step. 10 IO and CoT samples are generated per task. In this case, the LM determines if the sentence is already "fully coherent" conditional on the input constraints and the last generated sentence, and if not, generates a refined one.

ToT setup LM first generates k = 5 plans and votes for the best one (Fig. 4), then generates k = 5 passages based on the best plan as well and votes for the best one. Here we set the width limit b = 1, since only one choice is retained per step. Using a simple zero-shot voting prompt ("Analyze the following alternatives and conclude which is the most promising for instruction"), we sampled 5 votes in both steps.

Results Fig. 5(a) shows the average GPT-4 scores for the 100 tasks, indicating that the ToT (7.56) was judged to produce more coherent sentences than the IO (6.19) or CoT (6.93). While such automatic measures may be noisy, Fig. 5(b) supports this result by showing that humans prefer ToT over CoT in 41 of the 100 passage pairs, while they prefer CoT over ToT in 21 (the other 38 pairs were judged "similarly consistent"). Finally, iterative refinement was more effective in this natural language task, improving IO consistency scores from 6.19 to 7.67 and ToT consistency scores from 7.56 to 7.91. A third approach to thought generation in the ToT framework is to use independent identical distributions or sequential generation but rather the refinement of old thoughts, we believe that new thoughts can be considered to be generated by refining old thoughts.

mini crossword

In the 24 games and creative writing, the ToT was relatively shallow, requiring at most three thinking steps before reaching the final output. Here we consider a more challenging natural language search problem, a 5 x 5 mini-crossword. Again, the goal here is not simply to solve the task, since a typical crossword can be easily solved with a specialized NLP pipeline that leverages large-scale search instead of LM. Rather, the goal is to explore the limits of the LM as a general problem solver who explores his own thinking and guides his own search with deliberate reasoning as a discovery method.

Task Setup We obtained data from GooBix for 156 5x5 mini-crossword games. Since we know that adjacent games contain similar clues, we decided to use 20 games with indices 1, 6, - - , 91, and 96 for testing and 136, 141, 146, 151, and 156 for prompting. For each task, the input describes 5 horizontal clues and 5 vertical clues, and the output is a board of 5 x 5 = 25 letters for solving the crossword. For evaluation, we consider three levels of success: correct letter parts (25 per game), words (10 per game), and games.

Baseline The IO prompt illustrates five input/output pairs, and the CoT prompt further illustrates h1.... .5 -> v1.... .5, with intermediate words added in that order. We ran 10 samples of each prompt and averaged the results.

ToT Setup Utilizing depth-first search (Algorithm 2), the most promising subsequent word cue is continuously explored until the state is no longer promising, then backtracks to the parent state to explore alternatives. To facilitate the search, subsequent thoughts are constrained not to change the filled word or letter, with a maximum of 10 intermediate steps in the ToT. In thought generation, in each state, an existing thought (e.g., "h2.motor; h1.task" in Fig. 6(a)) is converted to a letter constraint for the remaining cues (e.g., "v1.To heap: tm ;...") ), and then prompts five times for suggestions to come up with candidates to fill in the location and content of the next word. The key is to encourage the LM to give a confidence level to different ideas, and to aggregate these across proposals to obtain a sorted list of ideas to explore next (Fig. 6(a)). For state evaluation, we similarly transform each state into a character constraint for the remaining clues and evaluate for each clue whether it is possible to fill given the constraint. If the remaining clues are deemed "unfillable" (e.g., "v1. To heap: tm s"), the search for that state's subtree is pruned and DFS backtracks to its parent to search for the next promising thought. limiting DFS's search steps to 100, the most deeply explored state (if more than one is the first explored state) is only rendered in the final output.

Results As shown in Table 3, the IO and CoT prompt methods performed poorly, with word-level success rates of less than 16%, whereas ToT significantly improved on all measures, achieving a 60% word-level success rate and resolving 4 of 20 games. IO and CoT have different Such improvements are not surprising given the lack of mechanisms to try cues, change decisions, or backtrack.

A study of o racles and cutoffs When output is from the oracle's (oracle's) best DFS state (rather than the best state determined by heuristic) for each task, ToT performs even better, actually resolving 7/20 games (Table 3, "+best state"). This indicates that the simple output discovery method of this paper can be easily improved upon. Interestingly, when the crossword game is actually solved, the state evaluator may prune some words as "impossible". Perhaps this is because 5x5 crosswords by design have unusual or obsolete words that the GPT-4 cannot recognize. Because state evaluation as a method of pruning detection is imperfect, we also considered disabling pruning, but found that it generally reduced performance (Table 3, "-prune"). In practice, however, we were able to find the correct solution in 4/20 games (but only 1 output by the discovery method), 3 of which are games that cannot be solved within 100 steps with ToT+pruning. Thus, improving the discovery method for DFS pruning is essential to solving the problem in this case. Finally, we confirmed the importance of backtracking by performing a carve-out that keeps filling in the most promising clues for a maximum of 20 steps and allowing overwriting. This was similar to a "greedy" BFS search with a width limit of b = 1, which performed poorly with only a 20% success rate at the word level (Table 3, "-backtrack").


Limitations and Future Directions

For many tasks that the GPT-4 is already good at, deliberate exploration like the ToT may not be necessary. As an initial step, this study challenged GPT-4 to explore only three relatively simple tasks that require better exploration and planning skills built into the LM. However, as we begin to implement LM in more realistic decision-making applications (e.g., coding, data analysis, robotics, etc.), more complex tasks may emerge and provide new opportunities to study these research questions. Also, while exploratory methods such as ToT require more resources (e.g., GPT-4 API costs) than sampling methods to improve task performance, the modular flexibility of ToT allows users to customize the trade-off between performance and cost and ongoing open source efforts should easily reduce this cost in the near future. Finally, this study focuses on the use of commercial LMs, and fine-tuning LMs using ToT-style high-level counterfactual decision making (e.g., pondering potential options for the next paragraph rather than predicting the next token) may enhance LMs' problem solving capabilities may provide an opportunity to.

Wider Impact

ToT is a framework that enables LMs to make decisions and solve problems more autonomously and intelligently. While current tasks are limited to reasoning and search problems, future applications involving interactions with the external environment and humans could pose potential dangers, such as encouraging harmful use of LMs. On the other hand, ToT would also improve the interpretability of model decisions and opportunities for human adjustment, since the resulting representations are readable, high-level linguistic inferences rather than implicit, low-level token values.


The associative "System 1" of LM can be usefully augmented by a "System 2" based on a tree search of possible paths to problem solving. The "tree of thought" framework provides a way to translate classical insights about problem solving into practical methods for modern LM. At the same time, LM addresses the weaknesses of these classical methods and provides a way to solve complex problems that cannot be as easily formalized as creative work. The authors and others thus see the intersection of LM with classical AI approaches as an exciting direction for future research.

友安 昌幸 (Masayuki Tomoyasu) avatar
JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us