ReAct: How giving LLMs the ability to think and act changed everything


Every agent system you interact with today runs some version of the same loop: think, act, observe, think again. ReAct is the paper that established why that loop works, and what breaks when you remove either half of it.Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

The paper’s own framing is precise: reasoning and acting in language models had “primarily been studied as separate topics.” ReAct’s contribution was testing what happens when you combine them — not just in one domain, but across fundamentally different task types — and analyzing the failure modes of each approach carefully enough to explain why the combination wins.


The two approaches it combined

Chain-of-thought prompting taught models to reason step by step before answering. It works — but it reasons in a closed loop. No new information enters. The model uses what’s already in its weights. When it doesn’t know something it doesn’t stop — it generates whatever comes next. The result is confident reasoning from potentially wrong premises.Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Published the same year as ReAct. The two papers are in direct conversation.

Action-only approaches gave models tools to interact with the world — search engines, APIs, browsers. The information problem is addressed: the model can look things up. But without reasoning to guide those actions the model can’t synthesize what it finds, can’t track what it’s already tried, can’t maintain a plan across many steps.Nakano, R. et al. (2021). WebGPT: Browser-Assisted Question-Answering with Human Feedback. Fine-tuned GPT-3 to use a text-based browser — search, click, scroll — but without explicit reasoning traces between actions.

ReAct — the name is the idea: Reasoning + Acting, interleaved.

Reason only
Chain-of-thought
Thinks step by step from memory. Never looks anything up.
good reasoning structurehallucinates factscan’t verify
+
Act only
Action-only
Takes actions and observes results. No internal reasoning to guide them.
can look things upgets lostloops without recovery
=
Reasoning + Acting
ReAct
Thinks, then acts, then updates reasoning from what it observes.
grounded reasoningtracks state0% hallucination

The ReAct loop

ReAct augments the model’s action space with language — a “thought” step that doesn’t affect the environment but updates the model’s context before the next action. In CoT, all reasoning happens up front from memory. In ReAct, each thought is informed by what the agent just observed through its actions.

The formal intuition

At each timestep , an agent takes action conditioned on its context — the full history of observations and actions so far:

ReAct augments the action space from to , where is the space of natural language. A thought differs from a regular action in one key way: it produces no observation from the environment. Instead it only updates the context:

This means thoughts are free — they cost nothing externally — but they change what the model conditions on for every subsequent action. CoT has no . Act has no . ReAct is .

The model generates a thought, takes an action, receives an observation, generates another thought informed by that observation. Repeat.

Each step feeds the next. The model never acts blindly — it always knows why it’s doing what it’s doing.
repeat

What this changes is what the model knows at each step. CoT’s context is static — training knowledge only. ReAct’s context grows with every observation. Each verified fact from the environment replaces a potential assumption from memory.This is the key insight: it’s not just that ReAct canlook things up. It’s that every observation displaces an assumption, making subsequent reasoning more reliable.

Step through the comparison below — this is a real HotpotQA example from the paper, the same knowledge benchmark that appears in the results. Watch the knowledge panel on each side.

Question
Apart from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with?
Chain-of-Thought
Thought
Let’s think step by step. Apple Remote was originally designed to interact with Apple TV. Apple TV can be controlled by iPhone, iPad, and iPod Touch.
CoT knows
assumedApple Remote → designed for Apple TV
assumedApple TV → controlled by iPhone/iPad/iPod
ReAct
Thought
I need to search Apple Remote and find what device it was originally designed for, then find what else can control that device.
ReAct knows
planNeed to look up Apple Remote
CoT immediately generates an answer from memory. ReAct forms a plan to look it up.
step 1 of 7

CoT assumes the Apple Remote was designed for Apple TV — plausible, wrong — and answers in one step. ReAct’s first search returns the actual answer: it was designed for Front Row, a discontinued media center program. That observation changes every subsequent step. The model isn’t reasoning from memory anymore. It’s reasoning from what it found.This example is drawn directly from the paper’s HotpotQA evaluation. The trajectories are real model outputs, not hand-crafted demonstrations.


What they tested it on

The paper tests ReAct on two task types that make structurally different demands on an agent.

The first is a navigation problem— multi-hop question answering (HotpotQA) where the answer requires chaining information you can’t know in advance. Each search opens new branches. The challenge is knowing which thread to follow. You can’t plan the full path before you start.

The second is a state problem— long-horizon decision making in a simulated household environment (ALFWorld) where every action changes the world. The challenge isn’t finding information. It’s not losing track of what you’ve done and what’s left across 20, 30, 50 steps.

The widget below uses actual trajectories from the paper’s appendix for the ALFWorld state-tracking task. Both agents have the same goal: find a knife, clean it, place it on the countertop.

ALFWorld Task
Put a clean knife in/on the countertop.
Act-only1/9 steps
Action
go to countertop 1
ReAct1/9 steps
Thought
I need to find a knife, clean it at sinkbasin, then put it on countertop.
Action
go to cabinet 1
Act immediately moves to the first location. ReAct starts by articulating a full plan — find → clean → place.
step 1 of 9

Act finds the knife, then tries to clean it without going to the sinkbasin first. Nothing happens. It has no record of why it failed or what comes next. It loops — back to the countertops, attempting to pick up knives that are no longer there.

ReAct’s thought 4 says verbatim: “Now I take a knife (1). Next, I need to go to sinkbasin (1) and clean it.” That sentence is in the context window. The model goes to the sinkbasin, cleans the knife, finishes in 7 actions.The loop isn’t a quirk. It’s what happens when a model has to reconstruct its current state by reading back through a long sequence of raw observations — and can’t.


Results

Decision making

The trajectory above is one example. The full picture is more striking. Across 134 unseen ALFWorld games, the chart below shows overall success rates and per-task-type breakdowns.

Overall
ReAct
71%
Act
45%
Pick
ReAct
92%
Act
88%
Clean
ReAct
58%
Act
42%
Heat
ReAct
96%
Act
74%
Cool
ReAct
86%
Act
67%
Examine
ReAct
78%
Act
72%
Pick 2
ReAct
41%
Act
41%
BUTLER, an imitation learning agent trained on thousands of expert demonstrations, achieves 37%. ReAct beats it with 1–2 prompt examples.

On WebShop — a simulated online store with 1.18 million products — ReAct hits 40% success versus 28.7% for the previous best approach (imitation learning + reinforcement learning).

The scale of the gap is worth sitting with. The previous best approaches required thousands of expert demonstrations and dedicated training pipelines. ReAct matches them with a couple of handwritten examples in a prompt.Shridhar, M. et al. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR 2021. BUTLER is the imitation learning agent introduced in this paper alongside the ALFWorld benchmark. It learns by watching thousands of expert demonstrations, then tries to replicate the behavior — the standard approach before LLM-based agents.

Before — BUTLER
1,000s
of expert demonstrations
Months of data collection.
Separate training pipeline. 37% success.
After — ReAct
1–2
prompt examples, written by hand
In minutes.
No training. No data pipeline. 71% success.

Knowledge tasks

On knowledge tasks the picture is more nuanced. ReAct does not cleanly beat CoT on its own — it wins on FEVER (60.9 vs 56.3) and loses on HotpotQA (27.4 vs 29.4). Both get most answers right (CoT 86%, ReAct 94%), but when they fail, they fail in opposite ways.

Failure mode breakdown — click any bar to learn more
CoT failures
ReAct failures
Select a failure mode above to see an explanation.

56% of CoT’s failures on HotpotQA are hallucinations — the model asserted facts it didn’t have. ReAct’s hallucination rate in failure cases? Zero. Every factual claim came from an observation it actually received. But ReAct’s constrained thought structure — having to fit reasoning into interleaved steps — produces more reasoning errors than CoT’s unconstrained chains. They fail in opposite directions.This is the core tradeoff: CoT has better reasoning flexibility but hallucinates. ReAct is grounded but constrained. Combining them recovers both strengths.

Which is why combining them recovers both. When ReAct fails to return an answer within its step budget, fall back to CoT-SC. When CoT-SC can’t reach a confident majority answer, fall back to ReAct. The combination reaches 35.1 on HotpotQA versus 29.4 for CoT alone.Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. CoT-SC samples multiple reasoning chains and takes the majority answer. More robust than a single chain, but still can’t verify facts against the world.


The bigger picture

Looking back from 2026, the lasting contribution isn’t any single benchmark number. It’s the pattern: give a language model the ability to think between actions, and it solves problems that neither reasoning nor acting can handle alone. Every agent framework built since — AutoGPT, LangChain agents, Claude’s tool use — runs some version of this loop.

ReAct didn’t just combine two techniques. It established the interface between language models and the world.

What it didn’t solve is what happens when the loop goes wrong.


Open question

ReAct works well in bounded environments — a Wikipedia API with three actions, a simulated store with a defined product space. The reasoning loop holds up. But scale the action space and a new problem emerges: when the agent goes wrong, there’s no recovery mechanism within the run. It keeps going, accumulating errors, with no way to step back and say “that approach isn’t working, let me try something different.” The loop is powerful but it’s memoryless across attempts.

Reflexion (Shinn et al., 2023)is the direct response to that. What if the agent could reflect on what went wrong in natural language, store that reflection, and use it to do better on the next attempt — without any retraining?

References
arXiv:2210.03629ICLR 2023 2022
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Submitted 6 Oct 2022
Open paper
arXiv:2201.11903NeurIPS 2022 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
Submitted 28 Jan 2022
Open paper
arXiv:2010.03768ICLR 2021 2021
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht
Submitted 8 Oct 2020
Open paper
arXiv:1809.09600EMNLP 2018 2018
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning
Submitted 25 Sep 2018
Open paper
arXiv:2203.11171ICLR 2023 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou
Submitted 21 Mar 2022
Open paper
arXiv:2112.093322021
WebGPT: Browser-Assisted Question-Answering with Human Feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, John Schulman
Submitted 17 Dec 2021
Open paper