Tree of Thoughts: What happens when you let a model explore before committing

@kraxkrokat · March 2026

Reflexion gave agents memory — the ability to learn from failure across attempts. But it could only learn backward: try, fail, reflect, try again. It never asked “what if I explored multiple paths beforecommitting to one?”Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.

Tree of Thoughts does exactly that. Instead of generating one reasoning chain and hoping it’s right, ToT treats problem-solving as search— generating multiple candidate thoughts at each step, evaluating which ones are worth pursuing, and pruning the rest. The model doesn’t just think. It deliberates.Same research group again — Shunyu Yao and Karthik Narasimhan co-author ToT, ReAct, and Reflexion. The papers form a deliberate progression: CoT (reason) → ReAct (act) → Reflexion (learn from failure) → ToT (search before committing).

What Tree of Thoughts actually does

ToT frames every problem as search over a tree of states, where each state is the input plus all reasoning accumulated so far. At each step, two things happen: a generator proposes candidate next thoughts, and an evaluator— the model itself — scores how promising each candidate is. Then a search algorithm (BFS or DFS) decides which branches to explore and which to prune.

The key architectural move is that the same model plays all three roles: generator, evaluator, and final answer producer. This is what separates ToT from CoT, where the model generates left-to-right in a single pass with no ability to explore alternatives or recover from early mistakes.The framing draws on cognitive science — Daniel Kahneman’s “System 1” (fast, associative) vs “System 2” (slow, deliberate) from Thinking, Fast and Slow (2011). Standard LLM decoding is System 1. ToT augments it with System 2 deliberation: generate options, evaluate them, choose strategically.

The visualization below walks through a complete BFS search on the Game of 24 — watch how candidates are generated, evaluated as “sure,” “maybe,” or “impossible,” and pruned at each level.

Step 0 — Starting position

Four numbers: 4 9 10 13. Reach 24 using each number once.

step 0 of 5

The formal intuition — from IO to ToT

The paper formalizes each prompting method as a sampling distribution over the same model . Each method generalizes the previous one:

IO prompting— sample one output directly:

Chain-of-thought— sample a reasoning chain, then the answer:

CoT-SC— sample k chains independently, majority vote:

Tree of Thoughts— search over a tree of states with generation + evaluation + pruning. A state is the input plus thoughts so far. At each step:

BFS keeps the best states per level. DFS goes deep, backtracking when the evaluator scores a state below threshold . IO, CoT, and CoT-SC are all special cases of ToT — trees of limited depth and breadth, with no evaluation.

Generator, evaluator, search

The generator produces candidate next thoughts. Two strategies: sample i.i.d. from a CoT prompt (good when the thought space is rich, like creative writing), or propose sequentially in a single context (good when thoughts are constrained, like one equation at a time in Game of 24).The propose strategy uses a remarkably minimal prompt — one example showing the format, then the current state. All search logic lives in Python, not in the prompt. The model just generates candidates; the orchestrator decides what to do with them.

the generator · propose prompt · Game of 24

One-shot prompt — candidates generated together in one pass to avoid duplicates

Propose prompt

Input: 2 8 8 14
2+8=10 (left: 8 10 14)
8/2=4 (left: 4 8 14)
14+2=16 (left: 8 8 16)
2*8=16 (left: 8 14 16)
8-2=6 (left: 6 8 14)
14-8=6 (left: 2 6 8)
14/2=7 (left: 7 8 8)
14-2=12 (left: 8 8 12)

Input: 4 9 10 13
Possible next steps:

→

Candidates

4+9=13 (left: 10 13 13)
13−9=4 (left: 4 4 10)
10−4=6 (left: 6 9 13)
9−4=5 (left: 5 10 13)
4×9=36 (left: 10 13 36)
13−10=3 (left: 3 4 9)
10+9=19 (left: 4 13 19)
10+4=14 (left: 9 13 14)

The propose prompt is one-shot — one example followed by the current input. All candidates generated in one pass to avoid duplicates.

The evaluator scores how promising each state is. Again, two strategies: a valueprompt that asks the model to rate each state independently (“sure / likely / impossible”), or a vote prompt that shows all candidates side-by-side and asks the model to pick the best one. Value works when you can judge states in isolation. Voting works when comparison is easier than absolute scoring.

the evaluator · value prompt · Game of 24

Few-shot prompt — LM reasons then outputs sure / likely / impossible — sampled 3× per node

Value prompt

Evaluate if given numbers can reach 24 (sure/likely/impossible)

10 14
10 + 14 = 24
sure

11 12
11 + 12 = 23
12 - 11 = 1
11 * 12 = 132
11 / 12 = 0.91
impossible

4 4 10
4 + 4 + 10 = 18
4 * 10 - 4 = 36
(10 - 4) * 4 = 24
sure

5 7 8
5 + 7 + 8 = 20
(8 - 5) * 7 = 21
I cannot obtain 24 now, but numbers are within a reasonable range
likely

1 3 3
1 * 3 * 3 = 9
(1 + 3) * 3 = 12
1 3 3 are all too small
impossible

... (3 more examples)

4 6

→

LM output

4 * 6 = 24
sure

Sampled 3×

suresuresure

20 + 20 + 20 = 60

verdict →sure

Score scale (log) — where does 60 land?

Two numbers — evaluator finds 4×6=24 immediately. Three sure votes = highest confidence. Branch prioritized.

The evaluator samples three times per state and aggregates the scores (sure = 20, likely = 1, impossible = 0.001). This is a crude but effective proxy for confidence — a state scored “sure” by all three samples is overwhelmingly preferred over one scored “likely” twice and “impossible” once.The scoring weights are hand-tuned. The paper doesn’t explore how sensitive results are to these specific values — whether a different weighting (say, sure = 10, likely = 2) would change the outcome is left as an implicit assumption.

score aggregation · sure=20 · likely=1 · impossible=0.001

Click any combination to see where it lands on the scale

Three tasks, three search shapes

ToT’s modularity means the same framework adapts to very different problems by changing what a “thought” is, how the evaluator scores, and which search algorithm navigates the tree. Game of 24 — which the visualizations above already walk through — uses short, structured thoughts (one equation), BFS to explore wide, and value scoring to rate each state independently. But the paper tests two more tasks that push the framework in very different directions.

Mini Crosswords requires DFS instead of BFS. Each thought is one word to fill in (5–10 per game), and the tree is too deep for breadth-first exploration. DFS commits to the most promising word, goes deeper, and backtracks when the evaluator deems a remaining clue “impossible” given the current letter constraints. The key insight: filling one row constrains every column it crosses — and the evaluator catches those conflicts before the search wastes compute going deeper.DFS with backtracking is what makes Crosswords work at all. Without it, word accuracy drops from 60% to 20%. The model can’t see 5 steps ahead, so it inevitably picks words that conflict with later clues. Backtracking lets it recover from those mistakes — the same principle as Reflexion, but within a single episode rather than across trials.

Creative Writing is the surprising case. The input is 4 random sentences; the output must be a coherent 4-paragraph passage ending in those sentences. You wouldn’t use CoT for this — there’s no “reasoning chain” that leads to a coherent essay. But ToT reframes it: a thought is an entire writing plan, not a single step. The tree has depth 2 — generate 5 plans, vote on the best one, then generate 5 passages from the winning plan, vote again. The evaluator uses voting instead of value scoring because coherence is easier to judge comparatively (“which plan is better?”) than absolutely (“rate this plan 1–10”).This is the first time in the series where the evaluator compares candidates side by side rather than scoring them independently. The voting prompt shows all 5 plans together and asks the model to pick the best one — a “step-wise self-consistency” strategy, as the paper calls it.

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

Across

h1. Duties; things to do

h2. Engine; drives action

h3. Imposing; magnificent

h4. Beauty parlor

h5. Daring feat

Down

v1. To heap; accumulate

v2. Place to practice

v3. Sly; crafty

v4. Fasteners

v5. Heavenly bodies

DFS tree

DFS Step 0 — Start

Empty 5×5 grid with 10 clues (5 across, 5 down). Each word you fill constrains the letters available for crossing words. DFS proposes candidates for h1 first.

step 0 of 6

The pattern: shallow trees with clear evaluation → BFS + value scoring. Shallow trees with subjective evaluation → BFS + voting. Deep trees with constraint satisfaction → DFS + backtracking. The framework is the same; the configuration changes everything.

4% to 74% — what search unlocks

The paper tests ToT on three tasks designed to challenge GPT-4 — not borrowed from existing benchmarks but designed specifically to require planning and search.Game of 24: use four numbers and basic arithmetic (+, −, ×, ÷) to reach 24. Example: input 4 9 10 13, solution (10 − 4) × (13 − 9) = 24. The paper tests on 100 hard games (indices 901–1000 from a set of 1,362 sorted by human solve time). Success = a valid equation that equals 24.

On Game of 24, the results are not marginal — they’re a different capability regime. CoT manages 4%. CoT with self-consistency (100 samples, majority vote) reaches 9%. ToT with b=5 achieves 74%.CoT-SC (Wang et al., 2022) samples k independent CoT chains and takes a majority vote on the final answer. It’s the strongest baseline because it explores multiple paths. But it explores them blindly — no evaluation, no pruning, no backtracking. ToT uses fewer completion tokens than CoT-SC (k=100), though it costs more overall due to repeated prompt context at each search step.

The visualization below puts CoT-SC and ToT side by side on the same input to show whybrute-force sampling can’t match guided search.

Same input, different strategies

Both receive the same four numbers.

step 1 of 5

On Creative Writing — write a coherent passage where four random sentences must appear as paragraph endings — ToT scores 7.56/10 on GPT-4 coherence ratings versus 6.93 for CoT and 6.19 for IO. In blind human comparisons, readers preferred ToT over CoT in 41 of 100 pairs versus 21 that preferred CoT.Creative Writing uses a different ToT shape: depth 2 (plan → passage), breadth 5 at the plan level but b=1 at the passage level. The model generates 5 plans, votes for the best one, then generates 5 passages from the winning plan and votes again. The “tree” is more of a tournament bracket than a deep search.

On 5×5 Mini Crosswords — the deepest search problem, with 5–10 thought steps per game — CoT achieves 15.6% word accuracy. ToT reaches 60%, solving 4 out of 20 full games. The ablations here are revealing: removing backtracking collapses word accuracy from 60% to 20%. Removing pruning drops it to 41.5%. Recovery from mistakes is the mechanism doing the real work, not just the extra LLM calls.The ablation structure here mirrors a controlled experiment: ToT with backtracking (60%), ToT without backtracking (20%), ToT without pruning (41.5%). The 40-point drop from removing backtracking alone tells you where the real leverage is — recovery from mistakes, not just evaluation quality.

The cost of deliberation

Search is not free. ToT uses roughly 5–100× more tokens than CoT depending on the task. On Game of 24, that’s ~$0.74 per task versus ~$0.003 for a single CoT chain. The paper acknowledges this but frames it carefully: ToT uses comparabletokens to CoT-SC with 100 samples (~$0.47), yet dramatically outperforms it. The compute isn’t wasted — it’s spent more intelligently.The paper reports average token counts per task (Table 1) but not dollar costs. We derived costs using 2023 GPT-4 8K pricing ($0.03/1K prompt tokens, $0.06/1K completion tokens). ToT costs more per token than CoT-SC despite using fewer total tokens because each search step re-sends the full context — prompt tokens dominate and they accumulate with every generator and evaluator call.

0%20%40%60%80%100%$0.001$0.01$0.10$1.00success ratecost per task (log scale)← better (higher & cheaper)IOCoTCoT-SC (k=100)IO best@100CoT best@100ToT (b=1)ToT (b=5)click a point to see details
ToT (b=5) — main resultToT (b=1)oracle baselines (best@100)CoT-SCIO / CoT

Even with an oracle that picks the single correct chain from 100 CoT samples, you only reach 49% — and at the same token cost as ToT. ToT b=5 reaches 74% with no oracle needed. The question isn’t whether search costs more. It’s whether the problems you’re solving are worth spending tokens on.The GPT-3.5 results (buried in the appendix) are telling: ToT with GPT-3.5 on Game of 24 drops to just 4% — no better than CoT with GPT-4. The evaluator’s ability to distinguish “sure” from “impossible” depends on model capability in ways the paper doesn’t fully explore.

The bigger picture

Looking back from 2026, ToT’s most important contribution isn’t the search algorithm — BFS and DFS are textbook techniques. It’s the demonstration that a single language model can play multiple functional roles simultaneously: generator, evaluator, and critic. Once you externalize those roles into separate agents that communicate, you’re essentially building a multi-agent system. ToT is the single-agent precursor to that decomposition.The connection to inference-time compute is direct. OpenAI’s o1 and o3 models use a similar principle — spending more compute at inference time to improve reasoning — but shift the search from explicit prompt-level orchestration (ToT’s approach) to implicit search trained via reinforcement learning. The insight that “more thinking time = better answers” is the same; the implementation moved inside the model.

ToT showed that a single model can generate, evaluate, and search over its own reasoning. That pattern — one system playing multiple deliberative roles — became a key precursor to the multi-agent architectures that followed.

The arc from CoT to ReAct to Reflexion to ToT traces a progression of capabilities: scratch space → action → memory → search. Each paper adds one dimension of problem-solving. And each one surfaces the same underlying question: how much can you trust the model to evaluate its own reasoning?

Open question

The results are impressive — but they share a pattern. Every task has a clear success criterion: does the equation equal 24? Does the word fit the crossword constraints? Even Creative Writing, the most subjective task, gets evaluated through side-by-side comparison where “better” is relatively judgeable. The paper never ventures into territory where the evaluator has no ground to stand on — tasks where “good” is genuinely ambiguous, where thoughts resist clean decomposition, or where the search space is too open-ended for scoring to mean anything. Those are exactly the conditions that broke Reflexion on WebShop.The evaluator is the most novel and most fragile part of ToT. On Creative Writing — the closest thing to an open-ended task in the paper — the gap between ToT and baselines is smallest. That’s not a coincidence: when “good” becomes subjective, the evaluator loses its leverage.

That blind spot matters more in hindsight. The GPT-3.5 results buried in the appendix show that ToT’s evaluator needs frontier-model capability to work at all — ToT with GPT-3.5 on Game of 24 scores 4%, no better than single-chain CoT with GPT-4. If the evaluator requires that level of capability, the question becomes whether the model can simply internalize the search process itself. OpenAI’s o1 and o3 suggest it can — spending more inference-time compute on internal deliberation rather than external orchestration. If that’s the direction, ToT becomes a historical proof of concept: it showed search works, and then the search moved inside the model. If explicit, auditable search retains value — for interpretability, for debugging, for domains where you need to see the reasoning — it becomes the foundation.

With ToT, the foundational arc is complete: scratch space (CoT), action (ReAct), memory (Reflexion), search (ToT). What comes next is decomposition — splitting these roles across multiple agents that coordinate, debate, and build on each other’s reasoning.

References

arXiv:2305.10601NeurIPS 2023 2023

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan

Submitted 17 May 2023

Open paper

arXiv:2303.11366NeurIPS 2023 2023

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao

Submitted 20 Mar 2023

Open paper

arXiv:2210.03629ICLR 2023 2022