Tree of Thoughts: What happens when you let a model explore before committing


Reflexion gave agents memory — the ability to learn from failure across attempts. But it could only learn backward: try, fail, reflect, try again. It never asked “what if I explored multiple paths beforecommitting to one?”Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.

Tree of Thoughts does exactly that. Instead of generating one reasoning chain and hoping it’s right, ToT treats problem-solving as search— generating multiple candidate thoughts at each step, evaluating which ones are worth pursuing, and pruning the rest. The model doesn’t just think. It deliberates.Same research group again — Shunyu Yao and Karthik Narasimhan co-author ToT, ReAct, and Reflexion. The papers form a deliberate progression: CoT (reason) → ReAct (act) → Reflexion (learn from failure) → ToT (search before committing).


What Tree of Thoughts actually does

ToT frames every problem as search over a tree of states, where each state is the input plus all reasoning accumulated so far. At each step, two things happen: a generator proposes candidate next thoughts, and an evaluator— the model itself — scores how promising each candidate is. Then a search algorithm (BFS or DFS) decides which branches to explore and which to prune.

The key architectural move is that the same model plays all three roles: generator, evaluator, and final answer producer. This is what separates ToT from CoT, where the model generates left-to-right in a single pass with no ability to explore alternatives or recover from early mistakes.The framing draws on cognitive science — Daniel Kahneman’s “System 1” (fast, associative) vs “System 2” (slow, deliberate) from Thinking, Fast and Slow (2011). Standard LLM decoding is System 1. ToT augments it with System 2 deliberation: generate options, evaluate them, choose strategically.

The visualization below walks through a complete BFS search on the Game of 24 — watch how candidates are generated, evaluated as “sure,” “maybe,” or “impossible,” and pruned at each level.

BFS: expand all level-1 survivors simultaneously4 9 10 13reach 2413−9=4[4, 4, 10]10−4=6[6, 9, 13]4+9=13[10, 13, 13]10−4=6[4, 6]4+4=8[8, 10]13−9=4[4, 6]9+6=15[13, 15]4×6=24 ✓solution
Step 0Starting position
Four numbers: 4 9 10 13. Reach 24 using each number once.
step 0 of 5
The formal intuition — from IO to ToT

The paper formalizes each prompting method as a sampling distribution over the same model . Each method generalizes the previous one:

IO prompting— sample one output directly:

Chain-of-thought— sample a reasoning chain, then the answer:

CoT-SC— sample k chains independently, majority vote:

Tree of Thoughts— search over a tree of states with generation + evaluation + pruning. A state is the input plus thoughts so far. At each step:

BFS keeps the best states per level. DFS goes deep, backtracking when the evaluator scores a state below threshold . IO, CoT, and CoT-SC are all special cases of ToT — trees of limited depth and breadth, with no evaluation.


Generator, evaluator, search

The generator produces candidate next thoughts. Two strategies: sample i.i.d. from a CoT prompt (good when the thought space is rich, like creative writing), or propose sequentially in a single context (good when thoughts are constrained, like one equation at a time in Game of 24).The propose strategy uses a remarkably minimal prompt — one example showing the format, then the current state. All search logic lives in Python, not in the prompt. The model just generates candidates; the orchestrator decides what to do with them.

the generator · propose prompt · Game of 24
One-shot prompt — candidates generated together in one pass to avoid duplicates
Propose prompt
Input: 2 8 8 14 2+8=10 (left: 8 10 14) 8/2=4 (left: 4 8 14) 14+2=16 (left: 8 8 16) 2*8=16 (left: 8 14 16) 8-2=6 (left: 6 8 14) 14-8=6 (left: 2 6 8) 14/2=7 (left: 7 8 8) 14-2=12 (left: 8 8 12)
Input: 4 9 10 13 Possible next steps:
Candidates
4+9=13 (left: 10 13 13)
13−9=4 (left: 4 4 10)
10−4=6 (left: 6 9 13)
9−4=5 (left: 5 10 13)
4×9=36 (left: 10 13 36)
13−10=3 (left: 3 4 9)
10+9=19 (left: 4 13 19)
10+4=14 (left: 9 13 14)

The propose prompt is one-shot — one example followed by the current input. All candidates generated in one pass to avoid duplicates.

The evaluator scores how promising each state is. Again, two strategies: a valueprompt that asks the model to rate each state independently (“sure / likely / impossible”), or a vote prompt that shows all candidates side-by-side and asks the model to pick the best one. Value works when you can judge states in isolation. Voting works when comparison is easier than absolute scoring.

the evaluator · value prompt · Game of 24
Few-shot prompt — LM reasons then outputs sure / likely / impossible — sampled 3× per node
Value prompt
Evaluate if given numbers can reach 24 (sure/likely/impossible) 10 14 10 + 14 = 24 sure 11 12 11 + 12 = 23 12 - 11 = 1 11 * 12 = 132 11 / 12 = 0.91 impossible 4 4 10 4 + 4 + 10 = 18 4 * 10 - 4 = 36 (10 - 4) * 4 = 24 sure 5 7 8 5 + 7 + 8 = 20 (8 - 5) * 7 = 21 I cannot obtain 24 now, but numbers are within a reasonable range likely 1 3 3 1 * 3 * 3 = 9 (1 + 3) * 3 = 12 1 3 3 are all too small impossible ... (3 more examples)
4 6
LM output
4 * 6 = 24 sure
Sampled 3×
suresuresure
20 + 20 + 20 = 60
verdict →sure
Score scale (log) — where does 60 land?
0.0033206060impossiblelikelysure

Two numbers — evaluator finds 4×6=24 immediately. Three sure votes = highest confidence. Branch prioritized.

The evaluator samples three times per state and aggregates the scores (sure = 20, likely = 1, impossible = 0.001). This is a crude but effective proxy for confidence — a state scored “sure” by all three samples is overwhelmingly preferred over one scored “likely” twice and “impossible” once.The scoring weights are hand-tuned. The paper doesn’t explore how sensitive results are to these specific values — whether a different weighting (say, sure = 10, likely = 2) would change the outcome is left as an implicit assumption.

score aggregation · sure=20 · likely=1 · impossible=0.001
Click any combination to see where it lands on the scale
prunekeep (low priority)keep (high priority)0.00332060log scale — one “sure” vote (20) outweighs 20,000 “impossible” votes (0.001 each)

Three tasks, three search shapes

ToT’s modularity means the same framework adapts to very different problems by changing what a “thought” is, how the evaluator scores, and which search algorithm navigates the tree. Game of 24 — which the visualizations above already walk through — uses short, structured thoughts (one equation), BFS to explore wide, and value scoring to rate each state independently. But the paper tests two more tasks that push the framework in very different directions.

Mini Crosswords requires DFS instead of BFS. Each thought is one word to fill in (5–10 per game), and the tree is too deep for breadth-first exploration. DFS commits to the most promising word, goes deeper, and backtracks when the evaluator deems a remaining clue “impossible” given the current letter constraints. The key insight: filling one row constrains every column it crosses — and the evaluator catches those conflicts before the search wastes compute going deeper.DFS with backtracking is what makes Crosswords work at all. Without it, word accuracy drops from 60% to 20%. The model can’t see 5 steps ahead, so it inevitably picks words that conflict with later clues. Backtracking lets it recover from those mistakes — the same principle as Reflexion, but within a single episode rather than across trials.

Creative Writing is the surprising case. The input is 4 random sentences; the output must be a coherent 4-paragraph passage ending in those sentences. You wouldn’t use CoT for this — there’s no “reasoning chain” that leads to a coherent essay. But ToT reframes it: a thought is an entire writing plan, not a single step. The tree has depth 2 — generate 5 plans, vote on the best one, then generate 5 passages from the winning plan, vote again. The evaluator uses voting instead of value scoring because coherence is easier to judge comparatively (“which plan is better?”) than absolutely (“rate this plan 1–10”).This is the first time in the series where the evaluator compares candidates side by side rather than scoring them independently. The voting prompt shows all 5 plans together and asks the model to pick the best one — a “step-wise self-consistency” strategy, as the paper calls it.

·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Across
h1. Duties; things to do
h2. Engine; drives action
h3. Imposing; magnificent
h4. Beauty parlor
h5. Daring feat
Down
v1. To heap; accumulate
v2. Place to practice
v3. Sly; crafty
v4. Fasteners
v5. Heavenly bodies
DFS tree
h1h2
DFS Step 0 Start
Empty 5×5 grid with 10 clues (5 across, 5 down). Each word you fill constrains the letters available for crossing words. DFS proposes candidates for h1 first.
step 0 of 6

The pattern: shallow trees with clear evaluation → BFS + value scoring. Shallow trees with subjective evaluation → BFS + voting. Deep trees with constraint satisfaction → DFS + backtracking. The framework is the same; the configuration changes everything.


4% to 74% — what search unlocks

The paper tests ToT on three tasks designed to challenge GPT-4 — not borrowed from existing benchmarks but designed specifically to require planning and search.Game of 24: use four numbers and basic arithmetic (+, −, ×, ÷) to reach 24. Example: input 4 9 10 13, solution (10 − 4) × (13 − 9) = 24. The paper tests on 100 hard games (indices 901–1000 from a set of 1,362 sorted by human solve time). Success = a valid equation that equals 24.

On Game of 24, the results are not marginal — they’re a different capability regime. CoT manages 4%. CoT with self-consistency (100 samples, majority vote) reaches 9%. ToT with b=5 achieves 74%.CoT-SC (Wang et al., 2022) samples k independent CoT chains and takes a majority vote on the final answer. It’s the strongest baseline because it explores multiple paths. But it explores them blindly — no evaluation, no pruning, no backtracking. ToT uses fewer completion tokens than CoT-SC (k=100), though it costs more overall due to repeated prompt context at each search step.

The visualization below puts CoT-SC and ToT side by side on the same input to show whybrute-force sampling can’t match guided search.

Same input, different strategies
Both receive the same four numbers.
step 1 of 5

On Creative Writing — write a coherent passage where four random sentences must appear as paragraph endings — ToT scores 7.56/10 on GPT-4 coherence ratings versus 6.93 for CoT and 6.19 for IO. In blind human comparisons, readers preferred ToT over CoT in 41 of 100 pairs versus 21 that preferred CoT.Creative Writing uses a different ToT shape: depth 2 (plan → passage), breadth 5 at the plan level but b=1 at the passage level. The model generates 5 plans, votes for the best one, then generates 5 passages from the winning plan and votes again. The “tree” is more of a tournament bracket than a deep search.

On 5×5 Mini Crosswords — the deepest search problem, with 5–10 thought steps per game — CoT achieves 15.6% word accuracy. ToT reaches 60%, solving 4 out of 20 full games. The ablations here are revealing: removing backtracking collapses word accuracy from 60% to 20%. Removing pruning drops it to 41.5%. Recovery from mistakes is the mechanism doing the real work, not just the extra LLM calls.The ablation structure here mirrors a controlled experiment: ToT with backtracking (60%), ToT without backtracking (20%), ToT without pruning (41.5%). The 40-point drop from removing backtracking alone tells you where the real leverage is — recovery from mistakes, not just evaluation quality.


The cost of deliberation

Search is not free. ToT uses roughly 5–100× more tokens than CoT depending on the task. On Game of 24, that’s ~$0.74 per task versus ~$0.003 for a single CoT chain. The paper acknowledges this but frames it carefully: ToT uses comparabletokens to CoT-SC with 100 samples (~$0.47), yet dramatically outperforms it. The compute isn’t wasted — it’s spent more intelligently.The paper reports average token counts per task (Table 1) but not dollar costs. We derived costs using 2023 GPT-4 8K pricing ($0.03/1K prompt tokens, $0.06/1K completion tokens). ToT costs more per token than CoT-SC despite using fewer total tokens because each search step re-sends the full context — prompt tokens dominate and they accumulate with every generator and evaluator call.

0%20%40%60%80%100%$0.001$0.01$0.10$1.00success ratecost per task (log scale)← better (higher & cheaper)IOCoTCoT-SC (k=100)IO best@100CoT best@100ToT (b=1)ToT (b=5)
click a point to see details
ToT (b=5) — main resultToT (b=1)oracle baselines (best@100)CoT-SCIO / CoT

Even with an oracle that picks the single correct chain from 100 CoT samples, you only reach 49% — and at the same token cost as ToT. ToT b=5 reaches 74% with no oracle needed. The question isn’t whether search costs more. It’s whether the problems you’re solving are worth spending tokens on.The GPT-3.5 results (buried in the appendix) are telling: ToT with GPT-3.5 on Game of 24 drops to just 4% — no better than CoT with GPT-4. The evaluator’s ability to distinguish “sure” from “impossible” depends on model capability in ways the paper doesn’t fully explore.


The bigger picture

Looking back from 2026, ToT’s most important contribution isn’t the search algorithm — BFS and DFS are textbook techniques. It’s the demonstration that a single language model can play multiple functional roles simultaneously: generator, evaluator, and critic. Once you externalize those roles into separate agents that communicate, you’re essentially building a multi-agent system. ToT is the single-agent precursor to that decomposition.The connection to inference-time compute is direct. OpenAI’s o1 and o3 models use a similar principle — spending more compute at inference time to improve reasoning — but shift the search from explicit prompt-level orchestration (ToT’s approach) to implicit search trained via reinforcement learning. The insight that “more thinking time = better answers” is the same; the implementation moved inside the model.

ToT showed that a single model can generate, evaluate, and search over its own reasoning. That pattern — one system playing multiple deliberative roles — became a key precursor to the multi-agent architectures that followed.

The arc from CoT to ReAct to Reflexion to ToT traces a progression of capabilities: scratch space → action → memory → search. Each paper adds one dimension of problem-solving. And each one surfaces the same underlying question: how much can you trust the model to evaluate its own reasoning?


Open question

The results are impressive — but they share a pattern. Every task has a clear success criterion: does the equation equal 24? Does the word fit the crossword constraints? Even Creative Writing, the most subjective task, gets evaluated through side-by-side comparison where “better” is relatively judgeable. The paper never ventures into territory where the evaluator has no ground to stand on — tasks where “good” is genuinely ambiguous, where thoughts resist clean decomposition, or where the search space is too open-ended for scoring to mean anything. Those are exactly the conditions that broke Reflexion on WebShop.The evaluator is the most novel and most fragile part of ToT. On Creative Writing — the closest thing to an open-ended task in the paper — the gap between ToT and baselines is smallest. That’s not a coincidence: when “good” becomes subjective, the evaluator loses its leverage.

That blind spot matters more in hindsight. The GPT-3.5 results buried in the appendix show that ToT’s evaluator needs frontier-model capability to work at all — ToT with GPT-3.5 on Game of 24 scores 4%, no better than single-chain CoT with GPT-4. If the evaluator requires that level of capability, the question becomes whether the model can simply internalize the search process itself. OpenAI’s o1 and o3 suggest it can — spending more inference-time compute on internal deliberation rather than external orchestration. If that’s the direction, ToT becomes a historical proof of concept: it showed search works, and then the search moved inside the model. If explicit, auditable search retains value — for interpretability, for debugging, for domains where you need to see the reasoning — it becomes the foundation.

With ToT, the foundational arc is complete: scratch space (CoT), action (ReAct), memory (Reflexion), search (ToT). What comes next is decomposition — splitting these roles across multiple agents that coordinate, debate, and build on each other’s reasoning.

References
arXiv:2305.10601NeurIPS 2023 2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan
Submitted 17 May 2023
Open paper
arXiv:2303.11366NeurIPS 2023 2023
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
Submitted 20 Mar 2023
Open paper
arXiv:2210.03629ICLR 2023 2022
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Submitted 6 Oct 2022
Open paper
arXiv:2201.11903NeurIPS 2022 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
Submitted 28 Jan 2022
Open paper
arXiv:2203.11171ICLR 2023 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou
Submitted 21 Mar 2022
Open paper
arXiv:2207.01206NeurIPS 2022 2022
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan
Submitted 4 Jul 2022
Open paper