Chain-of-Thought: The prompting trick that unlocked reasoning in language models


Before models could act in the world, they had to learn to reason about it. This paper introduced chain-of-thought prompting — a technique that made that possible not by changing the model, but by changing what you put in the prompt. It’s a surprisingly simple idea, and it became the foundation for nearly everything that followed.Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.


The problem standard prompting couldn’t solve

In 2022, scaling language models had hit a ceiling on reasoning tasks. More parameters helped with language. They didn’t help with arithmetic, commonsense reasoning, or symbolic manipulation. The scaling curves were flat.Before CoT, models were failing at questions a ten-year-old could answer: “Mike plays ping pong for 40 minutes. In the first 20 minutes he scores 4 points. In the second 20 minutes he scores 25% more. How many total?” Or: “A coin is heads up. Maybelle flips it. Shalonda doesn’t. Is it still heads up?” Not because they lacked knowledge — but because answering requires holding a chain of reasoning across multiple steps.

The standard approach — give the model a few input-output examples and ask it to continue the pattern — works well when the answer can be retrieved or pattern-matched. It fails when reaching the answer requires multiple steps of reasoning. The model sees the question and jumps directly to an answer. When the question is hard enough, that jump produces confident nonsense.

The cafeteria question below has a correct answer of 9. Standard prompting returns 27 — a plausible-looking number that you’d get by adding everything together without tracking what happened to it.

Standard prompting
Example in prompt
Q: Roger has 5 tennis balls. He buys 2 cans of 3 each. How many does he have now?
A: The answer is 11.
New question
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
Chain-of-thought prompting
Example in prompt
Q: Roger has 5 tennis balls. He buys 2 cans of 3 each. How many does he have now?
Roger started with 5 balls. 2 cans of 3 each is 6 balls. 5 + 6 = 11.
A: The answer is 11.
New question
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
Press play to see CoT reason step by step.
Same question. The only difference is the few-shot example: standard shows Q→A, CoT shows Q→reasoning→A. The model applies the same pattern to the new question.

The model isn’t wrong because it lacks knowledge. It’s wrong because it has no scratch space — no way to hold intermediate results while working toward a final answer. Every token it generates is a commitment. Without room to reason, it compresses multi-step problems into a single intuitive leap. Sometimes that works. For anything requiring bookkeeping, it doesn’t.


Chain of thought: show the reasoning, not just the answer

Instead of showing the model question-answer pairs, you show it question-reasoning-answer triples. The model sees how to work through a problem step by step, and applies the same pattern to new questions.

Standard
y ~ P(y | x)
xThe cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
yThe answer is 9.
↓ add r to exemplars
Chain-of-thought
r, y ~ P(r, y | x)
xThe cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
rThey had 23 apples originally. They used 20 to make lunch: 23 − 20 = 3. They bought 6 more: 3 + 6 = 9.
yThe answer is 9.

Standard prompting shows the model (xi, yi) pairs as exemplars. Given a new x, it samples y ~ P(y | x) — jumping directly to an answer. CoT adds reasoning traces to the exemplars: (xi, ri, yi). Now given x, the model samples r, y ~ P(r, y | x) — reasoning before answering. Ordering is load-bearing: placing r after y collapses performance to baseline.

The key shift is what gets generated. Standard prompting produces a single token sequence that jumps to an answer. CoT prompting produces a longer sequence where intermediate conclusions become context for subsequent steps. The model isn’t smarter — it’s using its own output as working memory.

The examples were manually written by the authors and reused across all benchmarks within a task category. Robustness experiments confirmed that different annotators, different writing styles, and different exemplar sets all produce similar gains. What matters is that the reasoning is shown — not how it is written.This is a key finding: CoT is robust to prompt variation. The mechanism doesn’t depend on specific wording or formatting — it depends on the presence of sequential reasoning steps.


Three hypotheses, three eliminations

The performance gain from CoT prompting requires an explanation. Three alternatives are tested and eliminated.An ablation study removes or modifies one component of a system at a time to test whether it’s responsible for an observed effect. If removing it collapses performance, it was load-bearing. If nothing changes, it wasn’t.

GSM8K accuracy · LaMDA 137B
original
ablations
full CoT
Exemplar structure
x
Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
r
Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11.
y
The answer is 11.
Sequential natural language reasoning precedes the answer. The only condition that meaningfully improves over baseline.

Every ablation collapses back to roughly the standard baseline (~6%). Only the full chain of thought breaks out — the only condition that gives the model actual scratch space. Why each alternative fails:

Extra compute tokens are not the mechanism. Replacing reasoning steps with dots matched to the same token length produces no gain over baseline. The scratch space is there, but it’s blank — there’s nothing to reason with.

Externalizing the equation is not sufficient. Showing the mathematical equation without natural language reasoning helps on simple datasets where problems translate directly to equations. On GSM8K, it fails — the problems are semantically complex enough that identifying which equation to write requires the reasoning steps themselves.GSM8K (Cobbe et al., 2021) is a dataset of 1,319 grade school math word problems requiring 2–8 steps of arithmetic. Example: “James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?” The paper’s headline benchmark — hardest of the five, and the one with the flattest standard-prompting scaling curve.

Knowledge activation is not sufficient. Placing the reasoning chain afterthe answer — so the model has already committed — collapses performance to baseline. If CoT helped merely by priming relevant knowledge, the order would not matter. It does.This is the most important ablation. It proves that the model needs to reason beforecommitting to an answer — not just have reasoning-shaped text nearby.

What survives: sequential natural language reasoning must precede the answer. The scratch space has to contain real work, and the model has to use it before committing. That is the mechanism.


Below 100B parameters, it backfired

There’s a catch. CoT only works if the model is large enough to use it. Below ~10B parameters, models generated fluent but incoherent reasoning chains and committed to wrong answers more confidently than if they had answered directly.This was counterintuitive: showing a small model how to reason made itworse. The model generated text that looked like reasoning but wasn’t — and then trusted it.

0.31310301005000%10%20%30%40%50%60%Model size (billions of parameters)GSM8K accuracy (%)100B17.9%56.9%PaLM 540B
PaLMLaMDAGPT|standardchain-of-thought

The gains only appear reliably above ~100B parameters. On GSM8K, PaLM 540B jumps from 17.9% to 56.9% — surpassing a model explicitly fine-tuned for this benchmark, using 8 prompt examples and no weight updates.Fine-tuned models are trained on thousands of labeled examples per task — their weights change to specialize. CoT prompting changes nothing in the model. The same weights, prompted differently, beat the specialist.

The paper interpreted this as an emergent ability — a capability that doesn’t improve gradually with scale but appears suddenly once a threshold is crossed.The paper uses “reasoning” throughout, but the authors are explicit: showing that a model produces better answers via intermediate steps doesn’t answer whether it’s actually reasoning in any meaningful sense. The chain of thought is an output — not a window into what the model is computing internally. Think less like a volume dial being turned up and more like water being heated: nothing dramatic happens between 0°C and 99°C, then at 100°C it changes state entirely.


What we’ve learned since

The original CoT paper left a sharp question open: are the weights of a 10B model fundamentally incapable of reasoning, or just incapable of being prompted into it?

Subsequent work answered this decisively: it’s the latter. Small models can reason — you just can’t get there through prompting alone. Fine-tuning on reasoning traces, distillation from larger models, and curated training data all unlock multi-step reasoning well below the ~100B threshold.The evidence accumulated quickly. Zelikman et al. (2022) showed models can bootstrap reasoning via iterative self-training (STaR). Mukherjee et al. (2023) fine-tuned a 13B model on GPT-4 explanation traces and matched much larger prompted models. DeepSeek-AI (2025) distilled reasoning into 1.5B–7B parameter models that perform multi-step reasoning the original paper said required 100B+.

This reframes the scale threshold. The capacity for reasoning doesn’t emerge at 100B parameters — it’s latent well below that. What scales is the model’s ability to activate that capacity from a few prompt examples alone, without any weight updates. CoT discovered something about the limits of prompting, not about reasoning itself.

The phase transition framing has also been challenged directly.Schaeffer et al. (2023) argued that many apparent emergent abilities are artifacts of discontinuous evaluation metrics like exact-match accuracy. Measured with continuous metrics, the scaling curves are smooth — no sudden jump, just gradual improvement that crosses a visibility threshold.If the “phase transition” is partly a measurement artifact, the dramatic story — reasoning appears suddenly at scale — needs qualifying. The capability may build gradually, invisible to coarse metrics until it crosses a threshold of usefulness.

What remains genuinely open is the mechanistic question. The ablations ruled out extra compute, but later work found that even logically invalid chains help, as long as they contain the right bridge entities connecting question to answer.Wang et al. (2023) found that what matters in a chain is the presence of relevant intermediate entities — not logical validity. Chains with wrong reasoning but correct bridge objects still improve performance, suggesting CoT works partly by keeping relevant information in context, not only by performing step-by-step deduction.So the deeper question is no longer “can small models reason?” — we know they can. It’s: what exactly changes during training that makes reasoning activatable through prompting? And why does that activation mechanism require scale when the underlying capacity does not? Those questions remain unanswered.


The bigger picture

Looking back from 2026, CoT’s contribution is easy to understate because the idea is so simple. But it established something fundamental: language isn’t just the medium for answers — it’s a substrate the model can reason through. Every technique that followed builds on that insight.

At sufficient scale, you don’t need to retrain a model to unlock reasoning. You just need to show it how to think. But CoT couldn’t solve the closed-world problem. The reasoning is powerful but sealed — no new information enters. When the model doesn’t know a fact, it invents one.

ReAct (Yao et al., 2022)is the direct response. It keeps CoT’s reasoning loop but opens it to the outside world: the model can pause mid-chain to search, look something up, or verify a fact, then fold what it finds back into its reasoning. Where CoT gave models scratch space, ReAct gave them hands.

References
arXiv:2201.11903NeurIPS 2022 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
Submitted 28 Jan 2022
Open paper
arXiv:2110.141682021
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
Submitted 27 Oct 2021
Open paper
arXiv:2203.11171ICLR 2023 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou
Submitted 21 Mar 2022
Open paper
arXiv:2203.14465NeurIPS 2022 2022
STaR: Bootstrapping Reasoning With Reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman
Submitted 28 Mar 2022
Open paper
arXiv:2306.027072023
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah
Submitted 5 Jun 2023
Open paper
arXiv:2501.129482025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Submitted 22 Jan 2025
Open paper
arXiv:2304.15004NeurIPS 2023 2023
Are Emergent Abilities of Large Language Models a Mirage?
Rylan Schaeffer, Brando Miranda, Sanmi Koyejo
Submitted 28 Apr 2023
Open paper
arXiv:2212.10001ACL 2023 2023
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, Huan Sun
Submitted 20 Dec 2022
Open paper