Contents
1. 서론
언어 모델은 기본적인 텍스트 생성에서 시작하여 점차 다양한 수준의 수학적, 기호적, 상식적, 지식 기반 인퍼런스 능력을 갖추게 되었습니다. 본 논문에서는 기존의 언어 모델을 활용한 문제 해결 방법이 가진 한계를 극복하고, 휴먼의 문제 해결 방식을 모사한 새로운 프레임워크인 Tree of Thoughts(Tree of Thoughts, ToT)를 제안합니다. ToT는 문제 해결 과정을 다양한 ‘사고’의 트리로 구성하여 효과적인 문제 해결을 도와줄 수 있습니다.
2. 배경
언어 모델, 특히 GPT와 같은 대규모 모델들은 $p_{\theta}(x) = \prod_{i=1}^{n} p_{\theta}(x[i] | x[1 \ldots i])$와 같은 자기회귀 방식을 사용하여 토큰 단위의 결정을 내립니다. 이런 기본 메커니즘은 입력 $x$를 출력 $y$로 변환하는 과정을 간단히 표현할 수 있도록 돕습니다. 예를 들어, 입력과 출력을 연결하는 체인 오브 쏘트(Chain of Thought, CoT) 방식은 중간 생각들을 연결하여 문제를 해결하도록 합니다.
\[[z_1 \ldots z_n, y] \sim p_{\text{CoT}}(z_1 \ldots z_n, y \\| x)\]이런 접근은 기존의 입력-출력 프롬프트 방식을 확장하여 보다 복잡한 문제에 적용될 수 있습니다.
3. Tree of Thoughts: 언어 모델을 이용한 심도 있는 문제 해결
ToT 프레임워크는 문제 해결 과정을 ‘사고’의 트리로 구성하여, 각 ‘사고’가 문제 해결을 위한 중간 단계로서 기능하도록 합니다. 각 노드는 현재까지의 입력과 사고의 시퀀스를 포함하며, 각 사고는 문제 해결을 향한 진전을 평가하는데 사용됩니다. 이는 다음과 같은 수학적 표현으로 설명할 수 있습니다.
\[s = [x, z_1 \ldots i]\]$s$는 각 사고의 상태를 나타내며, 다음 사고 후보들을 생성하고 평가하는 과정을 통해 문제 해결에 접근합니다. BFS나 DFS와 같은 탐색 알고리즘을 사용하여 트리를 체계적으로 탐색할 수 있습니다.
4. 실험
ToT 프레임워크는 ‘24 게임’, ‘창의적 글쓰기’, ‘미니 크로스워드’ 등 다양한 태스크에서 기존 방법보다 우수한 성능을 보였습니다. 이런 태스크들은 각각 수학적 인퍼런스, 창의적 사고, 그리고 계획적 탐색을 요구하며, ToT는 이런 복잡한 문제들을 효과적으로 해결할 수 있는 능력을 보이며 ToT는 다양한 사고의 생성과 평가를 통해 더 높은 성공률을 달성했다고 보고합니다.
5. 관련 연구
이 논문은 기존의 문제 해결 방식과 언어 모델을 결합하여 새로운 접근 방식을 제시합니다. 인공 지능의 초기 연구에서부터 최근의 대규모 언어모델까지, 다양한 연구가 이런 문제 해결 프레임워크의 기초를 마련하였습니다. ToT는 이런 기존 연구들을 통합하고 확장하여, 언어 모델을 이용한 보다 복잡한 문제 해결에 적합한 방법을 제안합니다.
6. 결론 및 미래 방향
ToT 프레임워크는 기존의 언어 모델이 갖는 한계를 극복하고, 보다 복잡한 문제 해결에 효과적으로 적용될 수 있는 새로운 방법을 제시합니다. 향후 연구에서는 더 다양한 문제 유형에 ToT를 적용하고, 그 효과와 한계를 더 깊이 탐구할 필요가 있습니다.
Originally designed to generate text, scaled-up versions of language models (LMs) such as GPT [25, 26, 1, 23] and PaLM [5] have been shown to be increasingly capable of performing an ever wider range of tasks requiring mathematical, symbolic, commonsense, and knowledge reasoning. It is perhaps surprising that underlying all this progress is still the original autoregressive mechanism for generating text, which makes token-level decisions one by one and in a left-to-right fashion. Is such a simple mechanism sufficient for a LM to be built toward a general problem solver? If not, what problems would challenge the current paradigm, and what should be alternative mechanisms?
The literature on human cognition provides some clues to answer these questions. Research on “dual process” models suggests that people have two modes in which they engage with decisions – a fast, automatic, unconscious mode (“System 1”) and a slow, deliberate, conscious mode (“System 2”) [30, 31, 16, 15]. These two modes have previously been connected to a variety of mathematical models used in machine learning. For example, research on reinforcement learning in humans and other animals has explored the circumstances under which they engage in associative “model free” learning or more deliberative “model based” planning [7]. The simple associative token-level choices of LMs are also reminiscent of “System 1”, and thus might benefit from augmentation by a more deliberate “System 2” planning process that (1) maintains and explores diverse alternatives for current choices instead of just picking one, and (2) evaluates its current status and actively looks ahead or backtracks to make more global decisions.
Figure 1: Schematic illustrating various approaches to problem solving with LLMs. Each rectangle box represents a thought, which is a coherent language sequence that serves as an intermediate step toward problem solving. See concrete examples of how thoughts are generated, evaluated, and searched in Figures 2,4,6.
To design such a planning process, we return to the origins of artificial intelligence (and cognitive science), drawing inspiration from the planning processes explored by Newell, Shaw, and Simon starting in the 1950s [21, 22]. Newell and colleagues characterized problem solving [21] as search through a combinatorial problem space, represented as a tree. We thus propose the Tree of Thoughts (ToT) framework for general problem solving with language models. As Figure 1 illustrates, while existing methods (detailed below) sample continuous language sequences for problem solving, ToT actively maintains a tree of thoughts, where each thought is a coherent language sequence that serves as an intermediate step toward problem solving (Table 1). Such a high-level semantic unit allows the LM to self-evaluate the progress different intermediate thoughts make towards solving the problem through a deliberate reasoning process that is also instantiated in language (Figures 2,4,6). This implementation of search heuristics via LM self-evaluation and deliberation is novel, as previous search heuristics are either programmed or learned. Finally, we combine this language-based capability to generate and evaluate diverse thoughts with search algorithms, such as breadth-first search (BFS) or depth-first search (DFS), which allow systematic exploration of the tree of thoughts with lookahead and backtracking.
Empirically, we propose three new problems that challenge existing LM inference methods even with the state-of-the-art language model, GPT-4 [23]: Game of 24, Creative Writing, and Crosswords (Table 1). These tasks require deductive, mathematical, commonsense, lexical reasoning abilities, and a way to incorporate systematic planning or search. We show ToT obtains superior results on all three tasks by being general and flexible enough to support different levels of thoughts, different ways to generate and evaluate thoughts, and different search algorithms that adapt to the nature of different problems. We also analyze how such choices affect model performances via systematic ablations and discuss future directions to better train and use LMs.
We first formalize some existing methods that use large language models for problem-solving, which our approach is inspired by and later compared with. We use $p_{\theta}$ to denote a pre-trained LM with parameters $\theta$, and lowercase letters $x, y, z, s, \ldots$ to denote a language sequence, i.e., $x = (x[1], \ldots, x[n])$ where each $x[i]$ is a token, so that $p_{\theta}(x) = \prod_{i=1}^{n} p_{\theta}(x[i] | x[1 \ldots i])$. We use uppercase letters $S, \ldots$ to denote a collection of language sequences.
Input-output (IO) prompting is the most common way to turn a problem input $x$ into output $y$ with LM:$y \sim p_{\theta}(y \mid \text{prompt}{\text{IO}}(x))$, where $\text{prompt}{\text{IO}}(x)$ wraps input $x$ with task instructions and/or few-shot input-output examples. For simplicity, let us denote \(f_{\text{IO}}(x) = \text{prompt}_{\text{IO}}(x)\) Then the IO prompting approach can be written as: \(y \sim p_{\theta}(y \mid f_{\text{IO}}(x))\)
\[p_{\text{prompt}} (\text{output} | \text{input}) = p_{\theta}(\text{output} | \text{prompt}(\text{input})),\]so that IO prompting can be formulated as:
\[y \sim p_{\theta}(y | \text{prompt}_{\text{IO}}(x)).\]Chain-of-thought (CoT) prompting was proposed to address cases where the mapping of input \(x\) to output \(y\) is non-trivial (e.g., when \(x\) is a math question and \(y\) is the final numerical answer). The key idea is to introduce a chain of thoughts \(z_1, \ldots, z_n\) to bridge \(x\) and \(y\), where each \(z_i\) is a coherent language sequence that serves as a meaningful intermediate step toward problem-solving (e.g., \(z_i\) could be an intermediate equation for math QA). To solve problems with CoT, each thought \(z_i \sim p_{\text{CoT}}(y \| x, z_1 \ldots z_n)\). In practice,
\[[z_1 \ldots z_n, y] \sim p_{\text{CoT}}(z_1 \ldots z_n, y | x)\]is sampled as a continuous language sequence, and the decomposition of thoughts (e.g., is each $z_i$ a phrase?).
Self-consistency with CoT (CoT-SC) is an ensemble approach that samples $k$ i.i.d. chains of thought:
\[[z^{(i)}] \sim p_{\text{CoT}}(z_1 \ldots z_n, y | x) \quad (i = 1 \ldots k),\]then returns the most frequent output:
\[\arg \max_y \#\{i \mid y^{(i)} = y\}.\]CoT-SC improves upon CoT, because there are generally different thought processes for the same problem (e.g., different ways to prove the same theorem), and the output decision can be more faithful by exploring a richer set of thoughts. However, within each chain, there is no local exploration of different thought steps, and the “most frequent” heuristic only applies when the output space is limited (e.g., multiple-choice QA).
A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution is finally discovered.—— Newell et al. [21]
Research on human problem-solving suggests that people search through a combinatorial problem-space – a tree where the nodes represent partial solutions, and the branches correspond to operators that modify them [21, 22]. Which branch to take is determined by heuristics that help to navigate the problem-space and guide the problem-solver towards a solution. This perspective highlights two key shortcomings of existing approaches that use LMs to solve general problems: 1) Locally, they do not explore different continuations within a thought process – the branches of the tree. 2) Globally, they do not incorporate any type of planning, lookahead, or backtracking to help evaluate these different options – the kind of heuristic-guided search that seems characteristic of human problem-solving.
To address these shortcomings, we introduce Tree of Thoughts (ToT), a paradigm that allows LMs to explore multiple reasoning paths over thoughts (Figure 1(c)). ToT frames any problem as a search over a tree, where each node is a state s = [x, z1···i] representing a partial solution with the input and the sequence of thoughts so far. A specific instantiation of ToT involves answering four questions: 1. How to decompose the intermediate process into thought steps; 2. How to generate potential thoughts from each state; 3. How to heuristically evaluate states; 4. What search algorithm to use.
(a) Breadth-first search (BFS) (Algorithm 1) maintains a set of the b most promising states per step. This is used for Game of 24 and Creative Writing where the tree depth is limit (T ≤ 3), and initial thought steps can be evaluated and pruned to a small set (b ≤ 5). (b) Depth-first search (DFS) (Algorithm 2) explores the most promising state first, until the final output is reached (t > T ), or the state evaluator deems it impossible to solve the problem from the current s (V (pθ, {s})(s) ≤ vth for a value threshold vth). In the latter case, the subtree from s is pruned to trade exploration for exploitation. In both cases, DFS backtracks to the parent state of s to continue exploration.
Conceptually, ToT has several benefits as a method for general problem-solving with LMs: (1) Generality. IO, CoT, CoT-SC, and self-refinement can be seen as special cases of ToT (i.e. trees of limited depth and breadth; Figure 1). (2) Modularity. The base LM, as well as the thought decomposition, generation, evaluation, and search procedures can all be varied independently. (3) Adaptability. Different problem properties, LM capabilities, and resource constraints can be accommodated. (4) Convenience. No extra training is needed, just a pre-trained LM is sufficient. The next section will show how these conceptual benefits translate to strong empirical performance in different problems.
We propose three tasks that are hard even when sampling from the state-of-the-art language model, GPT-4 [23], using standard IO prompting or chain-of-thought (CoT) prompting. We show how deliberate search in trees of thoughts (ToT) produces better results, and more importantly, interesting and promising new ways to use language models to solve problems requiring search or planning. Unless otherwise stated, we perform experiments using a Chat Completion mode GPT-41 with a sampling temperature of 0.7.
Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (+-*/) to obtain 24. For example, given input “4 9 10 13”, a solution output could be “(10 - 4) * (13 - 9) = 24”.
Figure 2: ToT in a game of 24. The LM is prompted for (a) thought generation and (b) valuation.
Task Setup. We scrape data from 4nums.com, which has 1,362 games that are sorted from easy to hard by human solving time, and use a subset of relatively hard games indexed 901-1,000 for testing. For each task, we consider the output as success if it is a valid equation that equals 24 and uses the input numbers each exactly once. We report the success rate across 100 games as the metric.
Baselines. We use a standard input-output (IO) prompt with 5 in-context examples. For chain-of-thought (CoT) prompting, we augment each input-output pair with 3 intermediate equations, each operating on two remaining numbers. For example, given input “4 9 10 13”, the thoughts could be “13 - 9 = 4 (left: 4 4 10); 10 - 4 = 6 (left: 4 6); 4 * 6 = 24 (left: 24)”. For each game, we sample IO and CoT prompting for 100 times for average performance. We also consider a CoT self-consistency baseline, which takes the majority output from 100 CoT samples, and an iterative-refine approach on top of an IO sample for at most 10 iterations. At each iteration, the LM is conditioned on all previous history to “reflect on your mistakes and generate a refined answer” if the output is incorrect. Note that it uses groundtruth feedback signals about equation correctness.
ToT Setup. To frame Game of 24 into ToT, it is natural to decompose the thoughts into 3 steps, each an intermediate equation. As shown in Figure 2(a), at each tree node, we exact the remaining numbers and prompt the LM to propose some possible next steps. The same “propose prompt” is used for all 3 thought steps, though it only has one example with 4 input numbers. We perform a breadth-first search (BFS) in ToT, where at each step we keep the best b = 5 candidates. To perform deliberate BFS in ToT, as shown in Figure 2(b), we prompt LM to evaluate each thought candidate as “sure/maybe/impossible” with regard to reaching 24. The aim is to promote correct partial solutions that can be verdicted within few lookahead trials, and eliminate impossible partial solutions based on “too big/small” commonsense, and keep the rest “maybe”. We sample values 3 times for each thought.
Table 2: Game of 24 Results.
Figure 3: Game of 24 (a) scale analysis & (b) error analysis.
Results. As shown in Table 2, IO, CoT, and CoT-SC prompting methods perform badly on the task, achieving only 7.3%, 4.0%, and 9.0% success rates. In contrast, ToT with a breadth of b = 1 already achieves a success rate of 45%, while b = 5 achieves 74%. We also consider an oracle setup for IO/CoT, by calculating the success rate using best of k samples (1 ≤ k ≤ 100). To compare IO/CoT (best of k) with ToT, we consider calculating the tree nodes visited per task in ToT across b = 1 · · · 5, and map the 5 success rates in Figure 3(a), treating IO/CoT (best of k) as visiting k nodes in a bandit. Not surprisingly, CoT scales better than IO, and best of 100 CoT samples achieve a success rate of 49%, but still much worse than exploring more nodes in ToT (b > 1).
Error analysis. Figure 3(b) breaks down at which step CoT and ToT samples fail the task, i.e. the thought (in CoT) or all b thoughts (in ToT) are invalid or impossible to reach 24. Notably, around 60% of CoT samples already failed the task after generating the first step, or equivalently, the first three words (e.g. “4 + 9”). This highlights the issues with direct left-to-right decoding.
Next, we invent a creative writing task where the input is 4 random sentences and the output should be a coherent passage with 4 paragraphs that end in the 4 input sentences respectively. Such a task is open-ended and exploratory, and challenges creative thinking as well as high-level planning.
Task setup. We sample random sentences from randomwordgenerator.com to form 100 inputs, and there is no groundtruth passage for each input constraint. As we find that GPT-4 can follow the input constraints most of the time, we focus on evaluating passage coherency in two ways: using a GPT-4 zero-shot prompt to provide a 1-10 scalar score, or using human judgments to compare pairs of outputs from different methods. For the former, we sample 5 scores and average them for each task output, and we find these 5 scores usually consistent, with a standard deviation of around 0.56 on average across outputs. For the latter, we employ a subset of the authors in a blind study to compare the coherency of CoT vs. ToT generated passage pairs, where the order of passages is random flipped over 100 inputs.
Baselines. Given the creative nature of the task, both IO and CoT prompts are zero-shot. While the former prompts the LM to directly generate a coherent passage given input constraints, the latter prompts the LM to first make a brief plan then write the passage, i.e. the plan serves as the intermediate thought step. We generate 10 IO and CoT samples per task. We also consider an iterative-refine (k ≤ 5) method on top of a random IO sample for each task, where the LM is conditioned on input constraints and the last generated passage to decide if the passage is already “perfectly coherent”, and if not generate a refined one.
ToT setup. We build a ToT with depth 2 (and only 1 intermediate thought step) — the LM first generates k = 5 plans and votes for the best one (Figure 4), then similarly generate k = 5 passages based on the best plan then vote for the best one. Here the breadth limit b = 1, as only one choice is kept per step. A simple zero-shot vote prompt (“analyze choices below, then conclude which is most promising for the instruction”) is used to sample 5 votes at both steps.
Results. Figure 5(a) shows average GPT-4 scores across 100 tasks, where ToT (7.56) is deemed to generate more coherent passages than IO (6.19) and CoT (6.93) on average. While such an automatic metric might be noisy, Figure 5(b) confirms the finding by showing that humans prefer ToT over CoT in 41 out of 100 passage pairs, while only prefer CoT over ToT in 21 (other 38 pairs are found “similarly coherent”). Lastly, iterative-refine is more effective on this natural language task, where it improves IO coherency score from 6.19 to 7.67, and ToT coherency score from 7.56 to 7.91. We believe it could be thought of as a third approach to thought generation in the ToT framework, where new thoughts can arise from refining old thoughts instead of i.i.d. or sequentially generated.
Table 3: Mini Crosswords results.
Figure 5: Creative Writing results.
In Game of 24 and Creative Writing, ToT is relatively shallow — at most 3 thought steps are needed to reach the final output. Here we explore 5 × 5 mini crosswords as a harder search problem involving natural language. Again, the goal is not just to solve the task, as more general crosswords can be readily solved with specialized NLP pipelines [34] that leverages large-scale retrieval instead of LM. Rather, we aim to explore the limit of LM as a general problem solver that explores its own thoughts and guides its own exploration with deliberate reasoning as heuristics.
Task setup. We scrape data from GooBix, which contains 156 games of 5 × 5 mini crosswords. As we observe adjacent games contain similar clues, we use 20 games with indices 1, 6, · · · , 91, 96 for testing, and games 136, 141, 146, 151, 156 for prompting. For each task, the input describes the 5 horizontal clues and 5 vertical clues, and the output should be a board of 5 × 5 = 25 letters to solve the crosswords. For evaluation, we consider three levels of success: the portion of correct letters (25 per game), words (10 per game), and games.
Baselines. We provide 5 example input-output pairs in the IO prompt, and in the CoT prompt additionally include intermediate words in the order h1..5 then v1..5. We run each prompt for 10 samples and average the results.
ToT setup. We leverage a depth-first search (Algorithm 2) that keeps exploring the most promising subsequent word clue until the state is no longer promising, then backtrack to the parent state to explore alternative thoughts. To make search tractable, subsequent thoughts are constrained not to change any filled words or letters, so that the ToT has at most 10 intermediate steps. For thought generation, at each state we translate all existing thoughts (e.g. “h2.motor; h1.tasks” for the state in Figure 6(a)) into letter constraints for remaining clues (e.g. “v1.To heap: tm ;…”) and prompt a proposal prompt 5 times to come up with candidates for where and what to fill in the next word. Importantly, we also prompt the LM to give a confidence level for different thoughts, and aggregate these across proposals to obtain a sorted list of next thoughts to explore (Figure 6(a)). For state evaluations, we similarly translate each state into letter constraints for remaining clues, then evaluate for each clue if it is possible to fill given the constraints. If any remaining clue is deemed “impossible” to fill in (e.g. “v1. To heap: tm s ”), then the exploration of the state’s subtree is pruned and DFS backtracks to its parent to explore the next promising thought. We limit DFS search steps to 100, and simply render the deepest explored state (the first explored one if multiple) into the final output.
Figure 6: In Mini Crosswords, (a) how thoughts are proposed and aggregated in a priority queue for depth-first search (DFS), and (b) how a state is evaluated based on the possibility of filling in each remaining word clue, and pruned if any remaining clue is deemed not possible to fill by the LM. Then DFS backtracks to the parent state and explore the next promising thought for clue.
Results. As shown in Table 3, IO and CoT prompting methods perform poorly with a word-level success rate less than 16%, while ToT significantly improves all metrics, achieving a word-level success rate of 60% and solving 4 out of 20 games. Such an improvement is not surprising, given IO and CoT lack mechanisms to try different clues, make changes to decisions, or backtrack.
Oracle and ablation studies. When outputting from the oracle best DFS state (instead of the heuristically determined best state) per task, ToT performance is even higher and actually solves 7/20 games (Table 3, “+best state”), indicating our simple output heuristics can be readily improved. Interestingly, sometimes when the crosswords game is actually solved, the state evaluator might still deem some words as “impossible” and prune — possibly because 5 × 5 crosswords by design have some rare or obselete words that GPT-4 cannot recognize2. Given the state evaluation as a pruning heuristic is imperfect, we also explore ablating the pruning, and find the performance generally worse (Table 3, “-prune”). However, it could actually find the correct solution for 4/20 games (though only outputting 1 via heuristic), 3 of which are games ToT+pruning cannot solve within 100 steps. Thus, better heuristics for DFS pruning are critical for problem solving in this case. Lastly, we confirm the importance of backtracking by running an ablation that keeps filling the most promising clue for at most 20 steps, allowing overwrites. This is similar to a “greedy” BFS search with breadth limit of b = 1, and performs poorly with a word level success of only 20% (Table 3, “-backtrack”).
Planning and decision making. Smart planning and decision making are critical to achieving predefined goals. As they are trained on vast amount of world knowledge and human examples, LMs are known to have already absorbed rich commonsense that makes it possible to propose reasonable plans conditioned on problem setting and environmental states [12, 42, 37, 13, 35, 41, 40]. Our proposed ToT approach extends existing planning formulations by considering multiple potentially feasible plans simultaneously at each problem-solving step, and proceeding with the most promising ones. The integration between thought sampling and value feedback organically integrates planning and decision-making mechanisms, enabling effective search inside a solution tree. On the other hand, traditional decision-making procedures usually require training dedicated reward and policy models as in reinforcement learning (for example CHAI [33]), whereas we use the LM itself to provide the value estimates for decision making. RAP [9] is a concurrent work that treats language model reasoning as planning with its internal world model, and proposes a MCTS-based method similar to ToT. However, its tasks are simpler than ours, and its framework lacks the modularity to incorporate different tree search algorithms.
2 For example, “agend” is an obsolete form of “agendum”, but GPT-4 deems it a typo for “agenda”. External retrieval or web interaction could augment LM for problem solving under knowledge uncertainty.
Self-reflection. Using LLMs to assess the viability of their own predictions is becoming an increasingly important procedure in problem solving. [28, 20, 24] introduced the “self-reflection” mechanism, in which LMs provide feedback to their generation candidates. [4] improves LMs code generation accuracy by injecting feedback messages generated by the LM itself based on its code execution results. Similarly, [17] also introduces “critic” or review steps over the actions and states, deciding the next action to take in solving computer operation tasks. Another recent work very relevant to ours is “self-eval guided decoding” [39]. Similar to our method, self-eval decoding also follows a tree-search procedure with leaves sampled from stochastic beam search decoding, which are then evaluated by LLM itself with carefully prepared self-eval prompts. Their approach however, uses the PAL formulation [8] which represents thoughts as codes, which makes it difficult to tackle challenging tasks like creative writing which we consider in this paper. Our Tree-of-Thought formulation is thus more versatile and handles challenging tasks on which GPT-4 only achieves very low accuracy with standard prompts.
Program-guided LLM generation. Our proposal is also related to recent advancements that organize LM’s behavior with systematic procedures [14, 44, 6, 43] or symbolic program guidance. For example, Schlag et al. [27] embeds LMs in an algorithmic search procedure to help solve problems like question answering step-by-step, in which the search trees are expanded by relevant paragraphs that might provide answers. This approach however differs from ours in that trees are expanded by sampling external paragraphs instead of the LM’s own thoughts, and there is no reflection or voting steps. Another approach, LLM+P [18], goes one step further and delegates the actual planning process to a classical planner.
Classical search methods. Last but not least, our approach can be treated as a modern rendition of classical search methods for problem solving. For example it can be considered as a heuristic search algorithm like A* [10], in which the heuristic at each search node is provided by the LM’s self-assessment. From this perspective, our method is also related to NeuroLogic Aesque decoding [19], which is inspired by A search but introduces look-ahead heuristics that are efficient for LMs to improve the beam-search or top-k sampling decoding. This method however is constrained to sentence generation tasks, whereas our framework are designed for complex, multi-step problem solving guarded by value feedback.
Limitations and future directions. Deliberate search such as ToT might not be necessary for many existing tasks that GPT-4 already excels at (see Appendix B.1), and as an initial step this work only explores three relatively simple tasks that challenges GPT-4 (see Appendix B.2 for some GPT-3.5 experiment results) and calls of better search and planning abilities incorporated with LMs. However, as we begin to deploy LMs for more real-world decision making applications (e.g. coding, data analysis, robotics, etc.), more complex tasks could emerge and present new opportunities to study these research questions. Also, search methods like ToT requires more resources (e.g. GPT-4 API cost) than sampling methods in order to improve task performances, but the modular flexibility of ToT allows users to customize such performance-cost tradeoffs, and ongoing open-source efforts [32] should readily reduce such costs in the near future. More details about cost and efficiency are in Appendix B.3. Lastly, this work focuses on using an off-the-shelf LM, and fine-tuning LMs using a ToT-style high-level counterfactual decision making (e.g. deliberating over potential choices for the next paragraph, instead of predicting the next token) might present opportunities to enhance the problem-solving capabilities of LMs.
Conclusion. The associative “System 1” of LMs can be beneficially augmented by a “System 2” based on searching a tree of possible paths to the solution to a problem. The Tree of Thoughts framework provides a way to translate classical insights about problem-solving into actionable methods for contemporary LMs. At the same time, LMs address a weakness of these classical methods, providing a way to solve complex problems that are not easily formalized, such as creative writing. We see this intersection of LMs with classical approaches to AI as an exciting direction.
Broader Impact
ToT is a framework that empowers LMs to more autonomously and intelligently make decisions and solve problems. While current tasks are limited to reasoning and search problems, future applications involving interaction with external environments or humans could bring potential danger, e.g. facilitating harmful uses of LMs. On the other hand, ToT also improves the interpretability of model decisions and the opportunity for human alignment, as the resulting representations are readable, high-level language reasoning instead of implicit, low-level token values.