Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Detailed Technical Review

Paper: Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Authors: Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan
Affiliations: Princeton University, Google DeepMind
Published: NeurIPS 2023 (arXiv: 2305.10601)
Reviewer: Zhongzhu Zhou
Review Date: February 16, 2026

I. Prerequisites: What You Need to Know

Before diving into the Tree of Thoughts framework, let us establish the foundational concepts that make this paper accessible, even if you are new to LLM reasoning research.

1.1 Autoregressive Language Models and Left-to-Right Commitment

A large language model (LLM) like GPT-4 generates text one token at a time, left to right. Given a sequence of tokens $x = (x_1, x_2, \ldots, x_n)$ , the model computes:

$p_\theta(x) = \prod_{i=1}^{n} p_\theta(x_i \mid x_1, \ldots, x_{i-1})$

This autoregressive mechanism means each token is conditioned on all previous tokens. The critical implication is commitment: once a token is generated, there is no built-in mechanism to reconsider or undo that choice. For straightforward text generation this works fine, but for tasks requiring exploration — where an early choice might lead to a dead end — this left-to-right commitment becomes a severe structural limitation.

1.2 Chain-of-Thought (CoT) Prompting

Chain-of-thought prompting (Wei et al., 2022) encourages LLMs to produce intermediate reasoning steps before giving a final answer. The mechanism is simple: include worked examples in the prompt that show step-by-step reasoning. Formally, CoT introduces a chain of thoughts $z_1, z_2, \ldots, z_n$ to bridge input $x$ and output $y$ :

$z_i \sim p_\theta^{\text{CoT}}(z_i \mid x, z_1, \ldots, z_{i-1})$

CoT is effective on many reasoning benchmarks but has a fundamental limitation: it generates a single, linear chain of reasoning. If the model makes a bad early decision, the entire chain is contaminated — there is no mechanism to reconsider or try a different path. This is exactly the limitation ToT addresses.

1.3 Self-Consistency (CoT-SC)

Self-consistency (Wang et al., 2022) is the first attempt to work around CoT's single-path limitation. The idea: sample $k$ independent chains of thought for the same problem, then majority-vote on the final answer:

$\hat{y} = \arg\max_y \#\{i \mid y^{(i)} = y\}$

It helps when different reasoning paths converge on the same answer. However, CoT-SC has two blind spots: (1) there is no local exploration within each chain — each chain is still a single committed path; (2) majority voting only works when the output space is small and discrete. For open-ended generation, CoT-SC offers little benefit.

1.4 Dual Process Theory (System 1 / System 2)

Kahneman's dual process framework (2011) distinguishes between fast, automatic "System 1" thinking and slow, deliberate "System 2" thinking. Standard LLM autoregressive generation maps to System 1 — fast, pattern-matching, token-by-token, no explicit planning. ToT's ambition is to approximate System 2 — maintain multiple hypotheses, evaluate progress, decide whether to continue or backtrack, and plan ahead. The connection is conceptual, not neuroscientific, but it provides a useful lens for understanding what ToT is trying to achieve.

1.5 Classical Search Algorithms: BFS and DFS

ToT plugs in standard search algorithms to explore the thought tree:

Breadth-first search (BFS): Explore all nodes at the current depth before moving deeper. Keep only the top- $b$ most promising nodes per level. Good for shallow trees where you want to compare multiple options before committing.
Depth-first search (DFS): Explore one branch as deeply as possible, backtracking when the state evaluator says the branch is hopeless. Good for deeper problems where you need to explore full solutions before deciding.

Understanding BFS and DFS is essential to understanding ToT's design choices for different tasks.

1.6 Heuristic Evaluation in Search

In classical AI, search algorithms use heuristic functions to estimate how promising a partial solution is. In A* search, the heuristic estimates the remaining cost to a goal. In game-playing (e.g., chess), evaluation functions score board positions. The key insight of ToT is to use the LLM itself as this heuristic evaluator — asking the model "how promising is this partial solution?" Previously, such heuristics were either hand-coded (like DeepBlue) or learned from domain-specific data (like AlphaGo). Using the LLM's own reasoning capabilities as a general-purpose evaluator is novel.

1.7 The Problem: When Left-to-Right Fails

Every existing LLM inference method at the time of this paper (IO prompting, CoT, CoT-SC, self-refinement) fundamentally operates by committing to a left-to-right sequence. Even CoT-SC's multiple chains are each individually committed. For tasks requiring exploration, strategic lookahead, or where early decisions are pivotal (combinatorial math, planning, puzzle-solving), this left-to-right commitment is a severe constraint. ToT breaks it by introducing tree-structured reasoning with evaluation and backtracking.

II. What This Paper Does: Core Contributions

2.1 The Key Insight

The central insight is reframing LLM problem-solving as search through a tree of thoughts. Instead of generating one continuous text sequence, the LM:

Decomposes the problem into intermediate "thought steps" (a thought = a coherent text chunk that represents partial progress toward a solution)
Generates multiple candidate thoughts at each step
Evaluates which partial solutions look most promising (using the LM itself as a heuristic evaluator)
Searches through the tree using classical algorithms (BFS or DFS), keeping good branches and pruning bad ones

This is directly inspired by Newell, Shaw, and Simon's foundational work on problem-solving as search through combinatorial spaces (1959, 1972). The novelty is using the LM both as the generator of search states and as the heuristic evaluator.

2.2 The Generalization Hierarchy

ToT subsumes existing methods as special cases:

IO prompting = a tree of depth 1, breadth 1 (single shot)
CoT = a tree of depth $n$ , breadth 1 per node (single chain)
CoT-SC = $k$ independent trees of depth $n$ , breadth 1, with majority voting (ensemble of chains, no cross-pollination)
ToT = a tree of depth $T$ , breadth $b$ , with evaluation-guided pruning and backtracking (full search)

This is elegant because it shows these methods are not fundamentally different paradigms — they are points on a spectrum of search depth and breadth. ToT simply moves to a region of this spectrum that allows genuine exploration.

2.3 How It Differs from Prior Work

The key differentiation from self-refinement or Reflexion is structural:

Self-refinement iteratively improves a single solution. It is linear — no branching, no parallel alternatives.
Reflexion adds episodic memory across attempts but still follows one path per episode.
ToT maintains multiple active partial solutions simultaneously and uses evaluation to decide which to expand. It explores width and depth within a single episode.

III. Methodology in Depth

3.1 Thought Decomposition

The first design decision in applying ToT to any problem is: what is a "thought"? A thought must satisfy two constraints:

Small enough that the LM can generate diverse, high-quality candidates. If a thought is "write an entire essay," diversity will be low and quality unpredictable.
Big enough that the LM can meaningfully evaluate whether it makes progress toward the goal. If a thought is a single token, evaluation is meaningless.

The paper demonstrates this with three different granularities:

Task	Thought Granularity	Tree Depth
Game of 24	One arithmetic equation using two numbers	$T = 3$
Creative Writing	A writing plan (paragraph-level outline)	$T = 2$
Mini Crosswords	Filling in one word	Up to $T = 10$

This design choice is problem-specific and is one of the most important practical decisions when deploying ToT. Getting the granularity wrong can make or break performance: too fine-grained wastes evaluation budget; too coarse-grained loses the exploration benefit.

3.2 Thought Generation

Two strategies for generating $k$ candidate thoughts from a given state $s = [x, z_1, \ldots, z_i]$ :

Strategy A: i.i.d. Sampling — Sample $k$ independent thoughts from a CoT-style prompt conditioned on the current state. Each sample is drawn independently, so diversity comes from sampling temperature:

$z^{(j)} \sim p_\theta^{\text{CoT}}(z_{i+1} \mid s), \quad j = 1, \ldots, k$

This works best when the thought space is rich and open-ended (e.g., Creative Writing: generate $k$ different paragraph plans).

Strategy B: Sequential Proposal — Use a "propose prompt" that asks the LM to enumerate $k$ different next steps in a single generation:

$[z^{(1)}, \ldots, z^{(k)}] \sim p_\theta^{\text{propose}}(z_{i+1}^{(1 \ldots k)} \mid s)$

This is better when the thought space is constrained (Game of 24: propose possible equations using remaining numbers; Crosswords: propose candidate words for unfilled clues). Sequential proposal avoids duplication and gives the LM context of what it has already proposed.

The choice between strategies has real cost implications: i.i.d. sampling requires $k$ separate LM calls; sequential proposal requires one. But sequential proposal can suffer from positional bias.

3.3 State Evaluation

This is where the framework derives its power. ToT uses the LM itself as a heuristic evaluator — essentially asking "how promising is this partial solution?"

Strategy A: Independent Value — Evaluate each state independently by prompting the LM to assess whether the current partial solution can lead to success. For Game of 24, this is a classification prompt:

"Can these remaining numbers reach 24?" → "sure / maybe / impossible"

The evaluator uses both forward reasoning (e.g., confirming $5+5+14=24$ ) and commonsense pruning (e.g., "1, 2, 3 are too small to reach 24"). Values are sampled multiple times (3 times in the paper) and aggregated.

Strategy B: Deliberative Voting — Present all candidate states to the LM simultaneously and ask it to vote for the most promising one. This is better when absolute valuation is difficult but relative comparison is feasible (Creative Writing: "which of these 5 plans is most coherent?"). Votes are sampled multiple times and aggregated by count.

Both strategies trade compute for evaluation quality — more samples mean more reliable heuristics but higher cost. The evaluation does not need to be perfect; it only needs to be approximately correct often enough to guide the search in the right direction. This is analogous to admissible heuristics in A* search — imperfect but directionally helpful.

3.4 Search Algorithm: BFS

BFS (Algorithm 1 in the paper) is used for Game of 24 and Creative Writing (shallow trees, $T \leq 3$ ):

Start with the input as the root state: $S_0 = \{x\}$
At each depth level $t = 1, \ldots, T$ : generate $k$ candidates from each of the $b$ active states, evaluate all candidates, keep the top- $b$
After $T$ steps, generate the final output from the best state

The beam width $b$ controls the exploration-exploitation tradeoff. $b=1$ is greedy (no branching); $b=5$ allows significant exploration. The paper shows that increasing $b$ from 1 to 5 on Game of 24 improves success from 45% to 74%.

3.5 Search Algorithm: DFS

DFS (Algorithm 2 in the paper) is used for Mini Crosswords (deeper trees, up to 10 steps):

Expand the most promising child of the current state
If the evaluator deems the current state "impossible," prune the subtree and backtrack to the parent
Continue until a solution is found, the step limit is reached, or all branches are pruned

DFS with pruning is crucial for crosswords because the search space is much larger. Without pruning, budget is wasted on dead-end branches. Without backtracking, the solver gets stuck with bad early word choices that make later clues impossible.

3.6 The Full Framework as a Modular Composition

The beauty of ToT is its modularity. For any new problem, you independently choose:

Thought decomposition — what granularity?
Thought generator — i.i.d. sampling or sequential proposal?
State evaluator — independent value or deliberative voting?
Search algorithm — BFS or DFS? What $b$ , $T$ , $v_{th}$ ?

These choices are orthogonal and can be mixed-and-matched. This modularity means ToT is not a single algorithm but a framework, and practitioners can adapt it to their specific problem structure.

IV. Experimental Setup and Results

4.1 Game of 24: Setup

Dataset: 1,362 games from 4nums.com, sorted by difficulty. Test set: 100 hardest games (indices 901–1000).
Metric: Success rate (valid equation = 24, using each input number exactly once).
Baselines: IO prompting (5-shot), CoT (5-shot with 3 intermediate equations), CoT-SC (majority of 100 samples), iterative refinement (up to 10 rounds with ground-truth correctness feedback).
ToT config: Depth $T=3$ (three equations), BFS with $b=5$ , sequential proposal for generation, 3-way value evaluation ("sure/maybe/impossible"), 3 value samples per state.
Model: GPT-4 (Chat Completion mode, temperature 0.7).

4.2 Game of 24: Results

Method	Success Rate
IO prompting	7.3%
CoT prompting	4.0%
CoT-SC ( $k=100$ )	9.0%
ToT ( $b=1$ )	45%
ToT ( $b=5$ )	74%
IO + Refine ( $k=10$ )	27%
IO best-of-100	33%
CoT best-of-100	49%

The improvement from CoT (4%) to ToT (74%) is not incremental — it is an 18.5× relative improvement. Even the oracle CoT best-of-100 only reaches 49%, while ToT with $b=5$ (visiting far fewer total nodes) achieves 74%. This demonstrates that ToT's structured exploration is fundamentally more efficient than random sampling with selection.

Error analysis is particularly revealing: approximately 60% of CoT samples fail at the very first step (the first arithmetic operation). Once you commit to a bad first operation, the remaining numbers often cannot reach 24. ToT avoids this by evaluating multiple first-step candidates and pruning unpromising ones before committing. This is exactly the "lookahead" capability that left-to-right decoding lacks.

Scaling analysis (Figure 3a in the paper) shows that ToT achieves better performance per node visited than either IO or CoT sampling. At equivalent computational budgets, tree search beats random sampling.

4.3 Creative Writing: Setup and Results

Dataset: 100 tasks, each requiring a 4-paragraph passage where each paragraph ends with a given random sentence.
Metrics: GPT-4 coherency score (1–10, average of 5 scores per output), human pairwise preference.
ToT config: Depth 2 (plan → passage), $k=5$ candidates per step, vote-based evaluation (5 votes per step), $b=1$ (keep only the best plan/passage).

Method	GPT-4 Score (avg)
IO prompting	6.19
CoT prompting	6.93
ToT	7.56
IO + Refine	7.67
ToT + Refine	7.91

The gains are smaller (6.19 → 7.56) because creative writing is inherently more subjective and less prone to catastrophic early-decision failures. But human evaluation confirms the pattern: in pairwise comparison, annotators prefer ToT over CoT in 41/100 cases vs. only 21/100 for CoT over ToT (38 tied).

The combination result is notable: ToT + iterative refinement achieves 7.91, better than either alone. This suggests that tree search (exploring multiple plans) and iterative refinement (polishing a single output) are complementary capabilities, not substitutes.

4.4 Mini Crosswords: Setup and Results

Dataset: 20 games from GooBix (5×5 crosswords), 5 games for prompting.
Metrics: Letter accuracy (25 per game), word accuracy (10 per game), games solved.
ToT config: DFS with pruning, up to 10 intermediate steps (one word per step), $k=5$ proposals per step, state evaluation via per-clue feasibility check ("impossible" → prune), maximum 100 DFS steps per game.

Method	Letters	Words	Games
IO (10 avg)	38.7%	14%	0/20
CoT (10 avg)	40.6%	15.6%	0/20
ToT	78%	60%	4/20
ToT + best state	85.4%	67%	7/20
ToT − prune	65.4%	41%	1/20
ToT − backtrack	47.2%	20%	0/20

The crosswords results are the most instructive for understanding ToT's mechanics:

Backtracking is essential: Removing backtracking collapses word-level success from 60% to 20%. This directly validates the claim that left-to-right commitment is fundamentally limiting for constraint-satisfaction problems.
Pruning helps but is imperfect: Removing pruning drops performance (60% → 41%), but the unpruned search actually finds more game solutions (4 vs. 1) — some are found on branches that pruning would have killed. The evaluator sometimes declares valid words "impossible" because GPT-4 does not know obscure or archaic words (e.g., "agend" as an obsolete form of "agendum").
Output heuristics matter: The oracle "best state" setting improves game solutions from 4/20 to 7/20, indicating that the search often finds the right solution but the selection heuristic sometimes picks the wrong final state.

4.5 Computational Budget Analysis

The paper provides a revealing analysis of computation vs. performance:

Game of 24 — Nodes visited vs. success rate (Figure 3a):

IO best-of-100 visits 100 nodes → 33% success
CoT best-of-100 visits 100 nodes → 49% success
ToT ( $b=5$ ) visits ~75 nodes → 74% success

ToT achieves higher success with fewer total nodes visited. This is because informed search (evaluating and pruning) is more efficient than uninformed sampling. The evaluation heuristic directs search effort toward promising regions of the solution space, while random sampling wastes budget on unpromising paths.

Cost decomposition per task:

Task	Generation Calls	Evaluation Calls	Total LM Calls	Relative Cost vs. CoT
Game of 24	~75	~225	~300	~60×
Creative Writing	10	10	~20	~4×
Crosswords	up to 500	up to 500	~1000	~200×

The cost increase is substantial but task-dependent. For Game of 24, the 60× cost increase buys an 18.5× improvement in success rate. For Creative Writing, 4× cost buys a ~22% quality improvement. Whether this is worthwhile depends entirely on the value of solving the task correctly.

4.6 Ablation: Effect of Beam Width

On Game of 24, the paper systematically varies beam width $b$ :

Beam Width ( $b$ )	Success Rate
1 (greedy)	45%
3	65%
5	74%

The improvement from $b=1$ to $b=5$ is dramatic: exploration matters. But note that even $b=1$ (greedy tree search with evaluation but no branching) already reaches 45% — far above CoT's 4%. This suggests that the evaluation component alone provides significant value, even without breadth.

4.7 Detailed Error Analysis for Game of 24

The paper provides error breakdowns that reveal why CoT fails:

~60% of CoT failures occur at the very first arithmetic step. The model commits to an operation (e.g., $4 + 9 = 13$ ) that makes it impossible to reach 24 with the remaining numbers.
~20% of failures involve correct intermediate steps but an error in the final step (arithmetic mistakes or failing to use all numbers).
~20% of failures involve the model outputting an invalid expression (wrong format, repeated numbers, etc.).

ToT addresses the first category directly: by generating multiple candidates for each step and evaluating them, the model can avoid the catastrophic first-step commitments that doom most CoT attempts.

4.8 Concrete Worked Example: Game of 24

To make the framework concrete, consider the input numbers 4, 5, 6, 10.

CoT approach (single chain):

Step 1: $4 + 5 = 9$ (commits immediately)
Step 2: $9 + 6 = 15$ (continues linearly)
Step 3: $15 + 10 = 25 \neq 24$ → Failure. No way to recover.

ToT approach (tree search with $b=3$ ):

Step 1 candidates:
- $a$ : $5 - 4 = 1$ → remaining: {1, 6, 10} → Evaluator: "maybe"
- $b$ : $10 - 6 = 4$ → remaining: {4, 4, 5} → Evaluator: "maybe"
- $c$ : $6 - 4 = 2$ → remaining: {2, 5, 10} → Evaluator: "sure" ( $2 \times (10 + 5 - 10)$ ... actually: $10 + 5 \times 2 = 20$ ... let's try: $5 \times 2 = 10, 10 + 10 = 20$ ... hmm)
- Actually: $10 - 4 = 6$ → remaining: {5, 6, 6} → Evaluator: "sure" ( $5 \times 6 - 6 = 24$ !)
Step 2 (expanding the "sure" candidate): $5 \times 6 = 30$ → remaining: {6, 30} → Evaluator: "sure"
Step 3: $30 - 6 = 24$ ✓ → Success!

The key difference: ToT can evaluate multiple first steps, identify that $10 - 4 = 6$ leaves a promising state (three numbers that can reach 24), and commit to that path. CoT's linear commitment to $4 + 5 = 9$ is a dead end from which there is no recovery.

V. Discussion and Insights

5.1 When ToT Helps Most

ToT adds value specifically when:

The task requires exploration of multiple alternatives
Early decisions significantly constrain later options
Intermediate progress can be meaningfully evaluated
The task is combinatorial or constraint-based

For straightforward QA, classification, summarization, or fluent text generation, standard prompting or CoT is both cheaper and sufficient. The paper itself notes (Appendix B.1) that ToT is unnecessary for tasks where GPT-4 already performs well with standard prompting.

5.2 Evaluation Quality as the Ceiling

The entire framework depends on the LM's ability to evaluate partial solutions. If evaluation is poor, the search degenerates into random exploration. The crosswords results demonstrate this directly — the evaluator sometimes prunes correct solutions because it does not recognize obscure words. For domains where intermediate-state evaluation is inherently difficult (e.g., long-horizon planning, creative endeavors with unclear success criteria), ToT's advantage may diminish. The quality of the evaluation heuristic is the ceiling for ToT's search effectiveness.

5.3 The Evaluation-Generation Asymmetry

A subtle but important observation: for ToT to work, the LM must be better at evaluating partial solutions than at generating correct ones directly. If the model could generate the right answer in one shot, there would be no need for search. If the model cannot evaluate partial solutions better than chance, search provides no benefit.

The Game of 24 results suggest this asymmetry exists: GPT-4 solves only 4% of problems via direct generation (CoT), but its 3-way value evaluations are accurate enough to guide search to 74% success. This means GPT-4's "recognition" of good intermediate states significantly exceeds its "generation" of correct solution chains.

This evaluation-generation asymmetry is likely task-dependent. For tasks where evaluation is as hard as generation (e.g., some creative tasks where "is this good?" is as hard as "make something good"), the ToT advantage would diminish.

5.4 Relation to Modern Inference-Time Compute Scaling

Since this paper's publication, several systems have incorporated ToT-style ideas:

OpenAI o1/o3 uses internal chain-of-thought with what appears to be search-like exploration during inference
Anthropic's extended thinking similarly allocates more inference compute for hard problems
DeepSeek-R1 and other reasoning models use reinforcement learning to train models that naturally explore multiple reasoning paths
LATS (Language Agent Tree Search) combines ToT with Monte Carlo Tree Search for agentic tasks

ToT can be seen as an important conceptual precursor to the "inference-time compute scaling" paradigm — the idea that harder problems should receive more computation at inference time, not just more parameters.

5.4 Connection to Planning in AI

ToT directly connects the classical AI planning tradition with modern LLM capabilities. Classical planning systems (STRIPS, HTN planners) explicitly search through state spaces with backtracking. ToT achieves something analogous by using the LLM as both the state-transition function and the heuristic evaluator. The framework demonstrates that structured search remains valuable even when the underlying "reasoning engine" is a neural network rather than a symbolic system.

5.5 Comparison with Monte Carlo Tree Search (MCTS)

ToT bears a structural resemblance to Monte Carlo Tree Search (MCTS), the algorithm behind AlphaGo. Both involve:

Tree-structured exploration of possible moves/thoughts
Evaluation of intermediate states
Backtracking and branch pruning

However, there are key differences:

Aspect	MCTS (AlphaGo)	ToT
State evaluator	Learned neural network + rollouts	LLM self-evaluation via prompting
Search policy	UCB1 balancing exploration/exploitation	BFS top- $b$ or DFS with pruning
Training	Value/policy networks trained on games	No training, purely inference-time
State space	Well-defined (board positions)	Open-ended (natural language)
Rollout depth	Full game simulation	Limited by thought depth $T$

The LATS paper (Zhou et al., 2024) later bridged this gap by combining ToT with MCTS-style value estimation for agentic tasks, demonstrating that the two approaches are complementary.

5.6 The Role of Temperature and Sampling

ToT's effectiveness depends on generating diverse candidate thoughts. For i.i.d. sampling, this diversity comes from the sampling temperature. Higher temperature produces more diverse but potentially lower-quality candidates. The paper uses temperature 0.7 for Game of 24, which balances diversity against coherence.

For sequential proposal, diversity is controlled by the prompt design (asking the model to enumerate multiple distinct options). The quality of the prompt directly affects the diversity and quality of proposals. This highlights that ToT's performance is a joint function of the search algorithm, the evaluation quality, and the generation diversity.

VI. Limitations and Boundary Conditions

6.1 Computational Cost

ToT is significantly more expensive than standard prompting. Each step involves generating $k$ candidates and evaluating each (potentially multiple times). For Game of 24 with $b=5$ : roughly 5 proposals × 3 value evaluations × 3 depth levels × 5 breadth ≈ 225 LM calls per game. For crosswords: up to 100 DFS steps, each with 5 proposals + evaluations = potentially 1000+ LM calls per game. In production settings with latency and cost constraints, this overhead is nontrivial.

6.2 Task Scope

The three evaluation tasks, while well-chosen, are relatively small-scale and self-contained. Game of 24 has only 4 numbers, Creative Writing is 4 paragraphs, and Crosswords is 5×5. The paper does not demonstrate ToT on tasks requiring dozens of reasoning steps, integration with external tools, or real-time interaction. Scaling ToT to open-ended, multi-step tasks with large state spaces remains an open question.

6.3 No Learning or Adaptation

ToT uses a frozen pretrained LM with no training or adaptation. The thought generator, evaluator, and search procedure do not improve from experience. Each problem is solved from scratch. This is both a strength (no training needed, purely inference-time) and a limitation (no ability to learn from solved problems, improve heuristics, or adapt search strategies).

6.4 Prompt Engineering Overhead

Each task requires careful prompt engineering for both thought generation and state evaluation. The "propose prompt" for Game of 24, the "vote prompt" for Creative Writing, and the confidence-based proposal prompt for Crosswords are all task-specific designs. Generalizing these prompts to new tasks without per-task tuning is an open challenge.

6.5 Evaluation Heuristic Failures

The crosswords ablation reveals that the LM evaluator can incorrectly prune valid solutions. This is a fundamental limitation: the search is only as good as its heuristic. For domains with vocabulary or knowledge gaps (e.g., specialized scientific reasoning, rare word puzzles), the evaluator's blind spots directly translate into missed solutions.

VII. Broader Research Context

7.1 Position in the LLM Reasoning Landscape

ToT arrived at a critical moment in LLM reasoning research. Chain-of-thought prompting had shown that intermediate reasoning steps improve performance, but the community was beginning to recognize that a single reasoning chain is insufficient for hard problems. ToT formalized the next step: structured search over multiple reasoning paths. It opened the door to viewing LLM inference as a planning problem amenable to classical AI techniques.

7.2 The Inference-Time Compute Paradigm

Perhaps the most lasting impact of ToT is its contribution to the idea that more computation at inference time can substitute for more parameters or more training data. By spending more LM calls on a hard problem (through search), ToT achieves dramatic improvements without changing the model. This principle has been validated and extended by subsequent work on reasoning models (o1, DeepSeek-R1) that internalize search-like behavior through training.

7.3 From Prompting to Architecture

ToT operates entirely through prompting — it requires no model modifications. This is both its accessibility strength and a potential limitation. Later work has explored whether ToT-style search can be "compiled" into model weights through training, potentially achieving similar exploration benefits at lower inference cost.

VIII. Reproducibility Assessment

8.1 Code and Data

The official code repository (https://github.com/princeton-nlp/tree-of-thought-llm) is publicly available and includes all prompts used in the experiments. This is a significant strength for reproducibility.

8.2 Challenges

Exact reproduction is challenging because the experiments use the GPT-4 API (May 2023 version), and OpenAI's models have been updated since. Results with current GPT-4 or GPT-4-turbo may differ. The paper does not report variance across runs, making stability assessment difficult.

8.3 Cost Estimates

Game of 24: ~225 LM calls per game → ~$675 for the 100-game test set at 2023 GPT-4 pricing
Crosswords: Up to 1000+ LM calls per game → significantly more expensive
Creative Writing: ~20 LM calls per task → cheapest of the three

8.4 Practical Deployment Guidance

When to use ToT:

Math/logic puzzles, constraint satisfaction, planning problems
Code generation where multiple approaches should be explored
Creative tasks where plan-level exploration beats output-level refinement

How to adapt ToT to a new task:

Define the thought granularity (what constitutes one step?)
Design a generation prompt (propose or sample?)
Design an evaluation prompt (value or vote?)
Choose search algorithm (shallow → BFS; deep → DFS)
Tune hyperparameters ( $b$ , $k$ , $T$ ) on a small validation set
Monitor evaluation accuracy — if the evaluator is wrong >50% of the time, ToT will not help

IX. Summary and Significance

Tree of Thoughts is one of those papers where the core idea is deceptively simple but the implications run deep. The contribution: instead of forcing a language model to commit to a single left-to-right reasoning chain, let it explore multiple reasoning paths organized as a tree, evaluate partial progress via self-assessment, and backtrack when things go wrong.

Core Value Propositions:

A general framework that subsumes IO, CoT, and CoT-SC as special cases within a unified search paradigm
Dramatic performance improvements on tasks requiring exploration: 4% → 74% on Game of 24
Demonstration that structured search is fundamentally more efficient than random sampling with selection
The LM itself serves as both state generator and heuristic evaluator, eliminating the need for task-specific search infrastructure

Key Lessons for the Research Community:

Left-to-right commitment is a genuine structural limitation for combinatorial and constraint-satisfaction tasks
The quality of intermediate-state evaluation is the ceiling for any search-based reasoning framework
Structured exploration (tree search) and iterative refinement (polishing) are complementary, not substitutes
Inference-time compute scaling via search is a viable alternative to scaling model parameters

The practical payoff is dramatic: on the Game of 24 benchmark, GPT-4 with chain-of-thought prompting solves only 4% of problems, while ToT pushes that to 74%. ToT gives LLMs something closer to deliberate "System 2" thinking — the slow, conscious problem-solving mode that humans use for hard tasks — and in doing so, it established the conceptual foundation for the inference-time compute scaling paradigm that has since become central to LLM reasoning research.

References

Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022.
Wang, X. et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR 2023.
Yao, S. et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.
Kahneman, D. (2011). Thinking, Fast and Slow.
Newell, A., Shaw, J.C., & Simon, H.A. (1959). "Report on a General Problem-Solving Program."
Hao, S. et al. (2023). "Reasoning with Language Model is Planning with World Model." EMNLP 2023.
Shinn, N. et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.
Silver, D. et al. (2017). "Mastering the Game of Go without Human Knowledge." Nature.
Zhou, A. et al. (2024). "Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models." ICML 2024.
Yao, S. et al. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023.

This review was completed by Zhongzhu Zhou on February 16, 2026.