1. What This Paper Does
Imagine you ask a very smart assistant to solve a math word problem. If you just say "answer this," the assistant might blurt out a number without thinking it through—and often get it wrong. But if you first show the assistant a few examples of how to think step by step, suddenly it can solve much harder problems. That is the core insight of this paper.
Wei et al. introduce chain-of-thought (CoT) prompting, a remarkably simple technique: instead of giving a language model plain input-output examples in a few-shot prompt, you include intermediate reasoning steps—a "chain of thought"—in each example. The model then learns to produce its own chain of thought before arriving at an answer. No fine-tuning, no new training data, no architectural changes—just a different way of writing your prompt.
The results are striking. On the GSM8K math word problem benchmark, a 540-billion-parameter PaLM model with just eight chain-of-thought exemplars achieves state-of-the-art accuracy of 58%, surpassing even a fine-tuned GPT-3 with a verifier (which scored 55%). The technique works across arithmetic reasoning, commonsense reasoning, and symbolic reasoning tasks.
This paper is one of the most influential works in the modern LLM era. It fundamentally changed how practitioners interact with large language models and opened the door to an entire family of prompting-based reasoning methods (Tree of Thoughts, Self-Consistency, Least-to-Most, etc.). Understanding it deeply is essential for anyone working with or building on top of LLMs.
2. Prerequisites: What You Need to Know First
Before diving into the technical details, let us build up the background knowledge you need. If you are already familiar with language models and prompting, feel free to skim this section—but we include it for completeness because the ideas here are foundational.
2.1 What Is a Language Model?
A language model is a system that has been trained to predict the next word (or token) in a sequence of text. Modern large language models (LLMs) like GPT-3, PaLM, and LaMDA are built on the Transformer architecture (Vaswani et al., 2017) and trained on massive corpora of text from the internet, books, code, and other sources.
The key insight is that by learning to predict "what comes next," these models absorb a vast amount of knowledge about language, facts, reasoning patterns, and even some degree of logical structure. The larger the model (more parameters) and the more data it sees, the more capable it generally becomes.
Parameters are the internal numbers (weights) that the model learns during training. When we say "PaLM 540B," we mean a model with 540 billion parameters—an extraordinarily large neural network.
2.2 What Is Prompting?
Prompting is a way to use a language model without additional training. Instead of fine-tuning the model on new data, you provide it with a carefully crafted input (the "prompt") that steers it toward producing the output you want.
Few-shot prompting, popularized by Brown et al. (2020) in the GPT-3 paper, works as follows: you include a few examples of input-output pairs at the beginning of your prompt, and then present a new input. The model, having seen the pattern, produces an output in the same format. For instance:
1 | Q: What is 2 + 3? A: 5. |
The model sees the pattern and fills in "31." This works surprisingly well for simple tasks, but breaks down for complex reasoning—the model tends to just guess an answer without working through the logic.
2.3 Why Do Language Models Struggle with Reasoning?
Despite their impressive capabilities, language models have a fundamental limitation when it comes to multi-step reasoning: they generate output token by token, with each token conditioned on all previous tokens. When a model is asked to directly produce a final answer, it must effectively compress all the intermediate reasoning into a single forward pass through the network.
Consider a problem like: "A store had 23 apples. They used 20 to make lunch and bought 6 more. How many do they have?" A human would naturally think: "23 minus 20 is 3, then 3 plus 6 is 9." But a model trained to produce answers directly might output "27" (23 + 6 = 29, or some other incorrect shortcut) because it never explicitly performed the subtraction step.
This is related to a deeper theoretical point: Transformer models process all tokens in parallel within a layer, and the number of "reasoning steps" they can perform is bounded by the number of layers. For problems requiring many sequential reasoning steps, the model's fixed depth becomes a bottleneck.
2.4 What Are Rationales and Intermediate Steps?
The idea of producing intermediate reasoning steps is not new to this paper. Prior work explored training models to generate rationales—natural language explanations that lead to the final answer:
- Ling et al. (2017) trained models from scratch on math word problems with step-by-step solutions.
- Cobbe et al. (2021) created the GSM8K dataset and fine-tuned GPT-3 to produce solution steps, then trained a separate "verifier" to check the solutions.
- Nye et al. (2021) showed that predicting intermediate computation states ("scratchpads") improves program execution prediction.
The limitation of these approaches is that they require training data with rationales, which is expensive to create. You need humans to write out step-by-step solutions for thousands or millions of examples.
2.5 Scaling Laws and Emergent Abilities
A crucial concept for understanding this paper is emergent abilities of language models. An emergent ability is one that is not present in smaller models but appears in larger ones—often quite suddenly.
Research has shown that many capabilities (following instructions, few-shot learning, basic reasoning) only emerge once models reach a certain scale threshold. Below that threshold, performance is essentially random; above it, the model suddenly "gets it." Wei et al. (2022b) documented this phenomenon extensively.
Chain-of-thought prompting turns out to be one of these emergent abilities: it only works with models of approximately 100 billion parameters or more. Smaller models, when prompted with chain-of-thought examples, produce fluent but logically incorrect reasoning chains, actually performing worse than standard prompting.
2.6 The Benchmarks
To understand the experiments, you need to know about the key benchmarks used:
-
GSM8K (Cobbe et al., 2021): 8,500 grade school math word problems requiring 2-8 steps of arithmetic reasoning. This is the paper's flagship benchmark. Example: "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends with four. She sells the rest at $2 each. How much does she make every day?"
-
SVAMP (Patel et al., 2021): Math word problems with structural variations designed to test robustness.
-
ASDiv (Miao et al., 2020): Diverse math word problems across different difficulty levels.
-
AQuA (Ling et al., 2017): Multiple-choice algebraic word problems.
-
MAWPS (Koncel-Kedziorski et al., 2016): A benchmark of easier math word problems, useful for baseline comparisons.
-
CSQA (Talmor et al., 2019): Commonsense questions about the world.
-
StrategyQA (Geva et al., 2021): Questions requiring multi-hop reasoning strategies ("Was Aristotle's nationality the same as Plato's?").
-
BIG-bench tasks: Date Understanding and Sports Understanding—specialized reasoning tasks from Google's Beyond the Imitation Game benchmark.
-
SayCan (Ahn et al., 2022): Robot planning tasks mapping natural language instructions to action sequences.
3. Core Method: Chain-of-Thought Prompting
3.1 The Basic Idea
The method is deceptively simple. In standard few-shot prompting, each exemplar is an (input, output) pair:
1 | Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. |
In chain-of-thought prompting, each exemplar becomes an (input, chain of thought, output) triple:
1 | Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. |
That is it. The "chain of thought" is a series of natural language reasoning steps inserted between the question and the final answer. When the model sees several such examples in its prompt, it learns to produce its own chain of thought for the new question, and this intermediate reasoning dramatically improves accuracy.
3.2 Why This Works: Four Key Properties
The authors identify four properties that make chain-of-thought prompting attractive:
Property 1: Decomposition. Chain of thought allows the model to break multi-step problems into intermediate steps. This effectively allocates more computation to harder problems—the model generates more tokens (and thus performs more "thinking") for complex questions.
This is a profound point. In standard prompting, every problem gets the same amount of computation regardless of difficulty. Chain-of-thought prompting naturally adapts: a simple question might get a one-sentence chain of thought, while a complex question gets a paragraph of reasoning.
Property 2: Interpretability. The generated chain of thought provides a window into the model's reasoning process. If the model gets a wrong answer, you can examine the chain of thought to identify exactly where the reasoning went wrong—was it a calculation error? A misunderstanding of the problem? A missing step?
Property 3: Generality. Because chains of thought are expressed in natural language, the method is applicable to any task that humans can solve by thinking through steps: math, commonsense reasoning, symbolic manipulation, planning, and more.
Property 4: Simplicity. No fine-tuning, no new model training, no architectural changes. You just need to write a few examples with chains of thought and prepend them to your prompt. This makes the method immediately usable with any off-the-shelf large language model.
3.3 The Prompt Design
The authors manually wrote eight chain-of-thought exemplars for arithmetic reasoning tasks. These are actual step-by-step solutions written in natural language. A critical point is that no prompt engineering was performed—the authors simply wrote natural solutions without optimizing them for performance. This makes the results more impressive, as better-engineered prompts might yield even higher accuracy.
For multiple-choice tasks (AQuA), four exemplars with solutions from the training set were used. For commonsense reasoning tasks, exemplars were written manually with appropriate chains of thought. For symbolic reasoning tasks, chains of thought described the step-by-step procedure.
The authors used the same set of eight exemplars across all free-response arithmetic benchmarks, demonstrating that a single prompt can generalize across different datasets—you do not need task-specific prompt design.
3.4 Decoding Strategy
All experiments use greedy decoding—the model simply picks the most likely next token at each step. The authors note that follow-up work (Wang et al., 2022a, on Self-Consistency) shows that sampling multiple chains of thought and taking the majority vote on the final answer further improves performance. But the results in this paper use the simplest possible decoding strategy.
4. Experimental Results: Arithmetic Reasoning
4.1 Main Results
The arithmetic reasoning experiments are the heart of the paper. The authors evaluate five model families:
- GPT-3 (InstructGPT variants): 350M, 1.3B, 6.7B, 175B parameters
- LaMDA: 422M, 2B, 8B, 68B, 137B parameters
- PaLM: 8B, 62B, 540B parameters
- UL2: 20B parameters
- Codex (code-davinci-002)
The results, presented in Figure 4 and Appendix Table 2, tell a compelling story:
| Model | GSM8K (Standard) | GSM8K (CoT) | Improvement |
|---|---|---|---|
| LaMDA 137B | 6.5% | 17.1% | +10.6pp |
| GPT-3 175B | 15.6% | 46.9% | +31.3pp |
| PaLM 62B | 18.1% | 33.7% | +15.6pp |
| PaLM 540B | 17.9% | 56.9% | +39.0pp |
| Codex | 19.7% | 63.1% | +43.4pp |
On GSM8K, chain-of-thought prompting more than triples PaLM 540B's accuracy. On SVAMP, PaLM 540B achieves 79.0% with CoT versus 63.1% with standard prompting. On MAWPS, the gap narrows because the problems are simpler—but CoT still helps.
Crucially, PaLM 540B with CoT prompting achieves 56.9% on GSM8K, surpassing the prior best of 55% (fine-tuned GPT-3 with a verifier trained on 8,000+ solutions). This is remarkable because CoT prompting uses no training data—just eight manually written examples.
4.2 The Emergence Phenomenon
The most important finding is that chain-of-thought prompting is an emergent ability of model scale:
- For models below ~10B parameters, CoT prompting hurts performance. Small models produce fluent but logically incorrect chains of thought, which leads them astray.
- The benefit appears only at around 100B parameters.
- The effect strengthens with scale: PaLM 540B benefits much more than PaLM 62B.
This emergence pattern, shown in Figure 4, has profound implications. It suggests that the ability to follow logical reasoning steps in a chain of thought is not something that gradually improves with scale—it appears relatively suddenly once the model is large enough to reliably execute individual reasoning steps.
The authors qualitatively examined why small models fail: they found that smaller models produce chains of thought that are grammatically correct but logically nonsensical. The model can mimic the format of step-by-step reasoning without actually performing the reasoning.
4.3 Error Analysis
The authors performed a detailed manual error analysis on LaMDA 137B's outputs on GSM8K:
Correct answers (50 samples): Of 50 cases where the model got the right answer, 48 had completely correct chains of thought. Only 2 arrived at the correct answer coincidentally through an incorrect chain of thought. This suggests that when CoT prompting works, it works for the right reasons—the model is genuinely performing step-by-step reasoning.
Incorrect answers (50 samples):
- 46% had minor errors: Calculator mistakes (8%), symbol mapping errors (16%), or one reasoning step missing (22%). The chain of thought was largely correct but had a small flaw.
- 54% had major errors: Semantic misunderstanding (28%) or incoherent chain of thought (26%).
Scaling fixes errors: When the authors compared errors from PaLM 62B to PaLM 540B, they found that scaling to 540B fixed a large proportion of the "one-step missing" and "semantic understanding" errors. This suggests that continued scaling could further reduce errors.
5. Experimental Results: Commonsense Reasoning
5.1 Setup and Benchmarks
To show that chain-of-thought prompting is not limited to math, the authors evaluate on five commonsense reasoning benchmarks: CSQA, StrategyQA, Date Understanding, Sports Understanding, and SayCan (robot planning).
For each benchmark, they manually wrote chain-of-thought exemplars. For example, for StrategyQA ("Yes or no: Would a pear sink in water?"), the chain of thought is: "The density of a pear is about 0.6 g/cm³, which is less than water. Thus, a pear would float. So the answer is no."
5.2 Results
The commonsense results (Figure 7, Table 4) show:
| Task | PaLM 540B Standard | PaLM 540B CoT | Improvement |
|---|---|---|---|
| CSQA | 77.5% | 79.9% | +2.4pp |
| StrategyQA | 65.4% | 75.6% | +10.2pp |
| Date Understanding | 62.5% | 77.0% | +14.5pp |
| Sports Understanding | 90.4% | 95.4% | +5.0pp |
| SayCan (robot planning) | — | 97.0% | — |
Key observations:
- StrategyQA sees the largest gain, reaching 75.6%—surpassing the prior supervised state of the art of 69.4%. This is impressive because StrategyQA requires multi-hop reasoning.
- Sports Understanding reaches 95.4%, exceeding the performance of "an unaided sports enthusiast" (84%).
- CSQA shows minimal improvement (+2.4pp), likely because the questions are more about knowledge retrieval than multi-step reasoning.
- SayCan demonstrates that CoT can be used for robot action planning—a different modality than pure language reasoning.
The commonsense results confirm that chain-of-thought prompting is broadly applicable wherever multi-step reasoning is needed, not just for mathematical computation.
6. Experimental Results: Symbolic Reasoning
6.1 Two Toy Tasks
The authors design two symbolic reasoning tasks:
Last Letter Concatenation: Given a name like "Amy Brown," concatenate the last letters of each word → "yn." This requires the model to identify each word, extract its last letter, and concatenate them.
Coin Flip: Track the state of a coin through a series of flip/no-flip operations. "A coin is heads up. Maybelle flips the coin. Shalonda does not flip the coin. Is the coin still heads up?" → "no."
6.2 In-Domain and Out-of-Domain Results
These tasks are interesting because they allow testing length generalization: the few-shot exemplars use 2-word names and 2-person coin flips, but evaluation includes 3-word and 4-word names, and 3-person and 4-person coin flips.
In-domain (same length as exemplars):
- PaLM 540B with CoT achieves near-100% on both tasks.
- Standard prompting also performs well on coin flip for PaLM 540B, but poorly for smaller models and for letter concatenation.
Out-of-domain (longer sequences):
- Standard prompting fails completely on longer sequences.
- CoT prompting shows upward scaling curves—larger models generalize better to longer sequences, though performance is lower than in-domain.
This is significant: it suggests that chain-of-thought prompting does not just memorize specific patterns but actually enables some degree of algorithmic generalization—the model learns a procedure and can apply it to longer inputs than it has seen in the prompt.
7. Ablation Studies: What Makes CoT Work?
The ablation study (Figure 5) is one of the most insightful parts of the paper. The authors test three variations to understand why chain of thought helps:
7.1 Equation Only
Setup: Instead of a full natural language chain of thought, the model is prompted to output only the mathematical equation before the answer.
Result: On GSM8K, equation-only prompting shows minimal improvement. But on simpler datasets (SingleOp, which requires only one arithmetic step), equation-only does help.
Interpretation: For complex multi-step problems, the semantic understanding expressed in natural language reasoning steps is essential—you cannot skip straight to the equation. The natural language serves as a bridge between the problem statement and the mathematical computation.
7.2 Variable Computation Only
Setup: The model is prompted to output a sequence of dots ("...") equal in length to the equation that would solve the problem. This tests whether simply generating more tokens (and thus performing more computation) is what helps.
Result: Performance is approximately the same as standard prompting—no improvement.
Interpretation: The benefit of chain-of-thought prompting is not simply about allocating more compute time. The content of the intermediate steps matters. The model needs to perform meaningful reasoning, not just burn tokens.
7.3 Chain of Thought After Answer
Setup: The chain of thought is provided after the answer in the exemplars, so the model learns to give the answer first and then explain it.
Result: Performance drops back to baseline levels.
Interpretation: The chain of thought must come before the answer to be useful. This confirms that the model actually uses the intermediate reasoning steps to arrive at the answer, rather than the chain of thought simply "activating relevant knowledge" that helps the model guess better.
These ablation results collectively demonstrate that chain-of-thought prompting works because it enables the model to perform genuine sequential reasoning, decomposing complex problems into manageable steps.
8. Robustness Analysis
8.1 Different Annotators
Three different annotators (A, B, C) independently wrote chains of thought for the same exemplars. Despite significant stylistic differences, all three sets of chains of thought substantially outperformed standard prompting. The variance between annotators was smaller than the gap between CoT and standard prompting.
8.2 Concise vs. Verbose
A deliberately concise version of the chain of thought (shorter sentences, fewer details) was tested. It still outperformed standard prompting, though not quite as much as the fuller version. This suggests that even minimal step-by-step reasoning helps.
8.3 Different Exemplars
Three sets of eight exemplars randomly sampled from the GSM8K training set (which includes solution steps) were tested. These performed comparably to the manually written exemplars, showing that the technique is not dependent on specifically crafted examples.
8.4 Exemplar Order and Count
Additional experiments (Appendix A.2) show that CoT prompting is robust to different orderings of exemplars and works with as few as 1-2 exemplars (though performance improves with more). This is noteworthy because standard few-shot prompting is known to be highly sensitive to exemplar ordering.
9. Limitations and Boundary Conditions
The authors are admirably transparent about limitations:
9.1 Scale Requirement
Chain-of-thought prompting only works with very large models (~100B+ parameters). This is a significant practical limitation because running 100B+ parameter models is expensive and requires specialized hardware. At the time of publication, such models were only available through API access from a few companies.
9.2 No Guarantee of Correct Reasoning
The model can produce plausible-sounding but incorrect chains of thought. The error analysis shows that 54% of incorrect answers on GSM8K involved major reasoning errors (not just calculation mistakes). A confident, well-formatted chain of thought can be completely wrong.
9.3 Not Actually "Reasoning"
The authors explicitly note that while chain of thought "emulates the thought processes of human reasoners, this does not answer whether the neural network is actually 'reasoning.'" The model might be performing sophisticated pattern matching rather than true logical reasoning. This remains an open philosophical and empirical question.
9.4 Cost of Longer Outputs
Chain-of-thought prompting increases the number of generated tokens, which increases inference cost and latency. For applications requiring real-time responses, this overhead may be significant.
9.5 Annotation Cost for Fine-Tuning
While the few-shot setting requires minimal annotation (just a handful of examples), using chain-of-thought for fine-tuning would require large-scale rationale datasets, which are expensive to create.
9.6 Task Dependency
CoT prompting helps most on multi-step reasoning tasks. For tasks that primarily require knowledge retrieval or pattern matching (like CSQA), the improvement is minimal. The technique is not a universal performance booster.
10. Reproducibility
10.1 Prompt Availability
All exemplar prompts are provided in the appendix (Appendix G, Tables 20-21), making exact reproduction possible if you have access to the same models.
10.2 Model Access
The main results use GPT-3 (available via OpenAI API), LaMDA (Google internal), PaLM (Google internal), and Codex (available via OpenAI API at the time). Today, many open-source models of 100B+ scale exist (e.g., LLaMA 70B, Mixtral, Qwen), making reproduction more accessible.
10.3 Evaluation
The evaluation uses standard benchmarks with publicly available datasets. The evaluation metric is simple: exact match accuracy on the final answer. The authors use greedy decoding throughout, making results deterministic for a given prompt and model.
10.4 Key Hyperparameters
- Number of few-shot exemplars: 8 for most tasks, 4 for AQuA
- Decoding: greedy (temperature = 0)
- For LaMDA: results averaged over 5 random seeds (different exemplar orderings)
11. Impact and Legacy
Chain-of-Thought Prompting is arguably the single most influential paper in the prompt engineering / LLM reasoning space. Its legacy includes:
Direct follow-ups:
- Self-Consistency (Wang et al., 2022): Sample multiple chains of thought, take majority vote → significantly improves over greedy CoT.
- Zero-Shot CoT (Kojima et al., 2022): Simply appending "Let's think step by step" to a prompt, without any exemplars, also elicits chain-of-thought reasoning.
- Least-to-Most Prompting (Zhou et al., 2022): Decompose complex problems into sub-problems.
- Tree of Thoughts (Yao et al., 2023): Explore multiple reasoning branches.
- Faithful CoT (Lyu et al., 2023): Generate code alongside natural language to ensure computational correctness.
Broader impact:
- CoT prompting became standard practice in virtually all LLM applications requiring reasoning.
- OpenAI's o1/o3 models and Anthropic's Claude incorporate chain-of-thought reasoning natively.
- The idea that models should "show their work" influenced model training (RLHF on reasoning traces), evaluation (process reward models), and deployment practices.
- The observation about emergent abilities at scale helped shape the "scaling laws" narrative that drove investment in larger models.
Paradigm shift: Before this paper, the dominant view was that language models needed task-specific fine-tuning to perform reasoning. After this paper, the field shifted toward understanding what reasoning abilities are already latent in large models and how to elicit them through clever prompting.
12. My Assessment
Strengths
-
Elegant simplicity. The method is trivial to implement and immediately applicable. This is rare in ML research, where methods often require complex infrastructure.
-
Comprehensive evaluation. The paper tests across three reasoning categories (arithmetic, commonsense, symbolic), five model families, and multiple model scales. The ablation study is particularly well-designed.
-
Honest analysis. The error analysis, limitations section, and ablation studies show genuine scientific rigor. The authors do not overclaim.
-
Massive practical impact. This is one of those rare papers that immediately changed how practitioners work.
Weaknesses
-
Scale requirement. The technique only works with the largest (and most expensive) models, limiting accessibility. Though this has partially been addressed as open-source models have grown larger.
-
No mechanistic understanding. The paper demonstrates that CoT works but offers limited insight into why at a mechanistic level. Why does scale matter? What internal representations enable chain-of-thought reasoning? These questions remain open.
-
Prompt sensitivity. While the robustness analysis is encouraging, CoT prompting performance still varies meaningfully across different prompt designs, which introduces unpredictability.
-
Benchmark limitations. The benchmarks, while diverse, are relatively constrained. GSM8K problems are grade-school level. Real-world reasoning tasks can be far more complex and ill-defined.
Overall
This is a landmark paper that deserves its enormous citation count and influence. It demonstrated, with clean experiments and minimal machinery, that large language models contain latent reasoning abilities that can be unlocked through the simple act of asking them to show their work. The paper's contribution is less a technical advance and more a conceptual revelation—one that reshaped the entire field of LLM research and applications.
Rating: 9/10 — Near-perfect as a scientific contribution. The simplicity of the method, rigor of evaluation, and magnitude of impact are exceptional. The only deductions are for the lack of mechanistic insight and the scale limitation.
Reviewed by Zhongzhu Zhou, March 30, 2026.