0%

SpeculativeDecoding technical review en

Speculative Decoding — In-Depth Technical Review (English)

Author: zhongzhu zhou
Paper: Fast Inference from Transformers via Speculative Decoding (ICML 2023 Oral)
ArXiv: https://arxiv.org/abs/2211.17192


TL;DR (1-minute version)

Speculative decoding is a way to make large language model (LLM) inference faster without changing the final output distribution of the target model. The key trick is simple but powerful: use a much smaller, faster draft model to “guess” several future tokens, then ask the large target model to verify those guesses in one parallel pass. If a guessed token is valid under a mathematically correct acceptance rule, keep it; if not, correct at the first mismatch and continue.

In practice, this can significantly reduce wall-clock latency and increase throughput because we replace many expensive sequential target-model forward passes with fewer batched checks. The paper reports meaningful speedups (often around 2× and sometimes higher) while preserving exactness.

Why this matters for beginners and practitioners:

  1. It is one of the cleanest examples of an algorithm that improves serving performance without retraining the big model.
  2. It shows how probability theory (rejection sampling style correction) can unlock systems-level gains.
  3. It became a foundation for many later “draft-and-verify” generation systems.

Estimated reading time: 25–35 minutes.


Abstract

I see this paper as a bridge between theory and production inference engineering. The authors ask a practical question: “Can we decode with fewer expensive steps on the big model, but still sample exactly from that big model?” Their answer is yes, through speculative decoding.

The method combines two ideas: (1) a small model quickly proposes a short token block, and (2) the target model evaluates that block in parallel and uses an acceptance-correction rule to preserve exactness. This makes the algorithm fast while avoiding approximation drift in output quality distribution.

To me, the most important contribution is not just speedup numbers; it is the exactness guarantee under acceleration. That property makes speculative decoding attractive in production where teams care deeply about both latency and behavior consistency.


1. Prerequisites: What to Know Before Reading This Paper

1.1 Autoregressive decoding and why it is slow

Most decoder-only LLMs generate text one token at a time:

P(x_{1:T}) = \prod_{t=1}^{T} P(x_t \mid x_{<t})

At each step, we run a forward pass, get a distribution over vocabulary, sample/select one token, append, and repeat. This is inherently sequential because token xtx_t is needed before computing xt+1x_{t+1}.

For beginners, here is the practical implication:

  • Training can be highly parallelized.
  • Inference (decoding) is often bottlenecked by sequential dependence.
  • Even if GPU is powerful, each next-token step forces synchronization and memory movement.

So when users complain “model is slow,” the root cause is often decoding seriality, not pure FLOPs shortage.

1.2 Sampling, exactness, and distribution preservation

In generation, we do not just want any fast text—we want text sampled from a target distribution (after optional temperature/top-k/top-p transforms). If we modify decoding naively, we may shift distribution and change behavior.

This paper’s core promise is:

  • We can accelerate,
  • while still generating from the same target model distribution.

That “same distribution” claim is extremely important in high-stakes workflows (evaluation reproducibility, safety tuning stability, product consistency).

1.3 Rejection sampling intuition (plain-language)

Rejection sampling is like this:

  1. Use an easy proposal distribution qq to draw candidates.
  2. Accept candidate with probability related to target/proposal ratio.
  3. If rejected, redraw from a corrected residual distribution.

If done correctly, accepted outputs follow target distribution pp, even though proposals came from qq.

Speculative decoding is essentially a sequential-and-parallel hybrid form of this idea at token level.

1.4 KV cache and serving bottlenecks

Modern transformer inference heavily uses KV cache: each new token appends key-value states so future attention can reuse old states. This reduces recomputation but still keeps generation token-serial.

Speculative decoding helps because a block verification pass amortizes expensive target-model interactions, improving effective token-per-target-step efficiency.


2. What This Paper Does (The Core Idea)

The method runs two models:

  • Draft model MdM_d: small and fast; proposes kk future tokens.
  • Target model MtM_t: large and accurate; verifies proposals in one pass.

Pipeline for one iteration:

  1. Conditioned on current prefix, draft model generates tokens x^t+1,,x^t+k\hat{x}_{t+1}, \ldots, \hat{x}_{t+k}.
  2. Target model evaluates probabilities for these positions in parallel.
  3. For each proposed token, perform acceptance test using target-vs-draft probability ratio.
  4. Accept a prefix of proposed tokens until first rejection.
  5. At rejection point, sample one corrected token from residual distribution and continue.

I like this design because it directly attacks the expensive component: number of large-model decode steps. Instead of forcing target model to do one-step-at-a-time for all tokens, we let small model do speculative lookahead.

Why this is different from approximate acceleration

Many speedup techniques trade quality or alter output statistics. Speculative decoding is different: it preserves exact target distribution under the algorithm’s correction rule. That means you can drop it into existing generation systems with much lower semantic risk.

Figure-level intuition (as commonly shown in speculative decoding papers)

When you see the core diagram, read it as:

  • bottom lane = fast draft trajectory,
  • top lane = expensive verifier,
  • acceptance boundary = where verifier disagrees.

The system gains speed whenever multiple consecutive draft tokens are accepted before first rejection.


3. Method Details

3.1 Draft generation phase

At prefix x1:tx_{1:t}, draft model samples/proposes kk tokens sequentially:

x^t+iqi()=Pd(x1:t,x^t+1:t+i1)\hat{x}_{t+i} \sim q_i(\cdot) = P_d(\cdot \mid x_{1:t}, \hat{x}_{t+1:t+i-1})

Why this helps:

  • Draft model is smaller, so each step is cheap.
  • We get a block candidate quickly.

Engineering note:

  • In serving stacks, draft model can be quantized aggressively or even distilled specifically for domain prompts.
  • If draft is too weak, acceptance rate drops and speedup shrinks.
  • If draft is too large, drafting cost eats benefits.

So there is a practical optimization problem: choose draft size for best end-to-end latency.

3.2 Parallel verification by target model

Given full proposed block, target model computes position-wise conditional probabilities for that block. This is where parallelism appears: one pass can score multiple speculative positions.

Denote target probability for proposed token at position $i$ as $p_i = P_t(\hat{x}_{t+i}\mid \text{proper context})$, and draft probability as $q_i$.

Acceptance for each position is based on ratio pi/qip_i/q_i (clipped at 1):

a_i = \min\left(1, \frac{p_i}{q_i}\right)

Accept with probability aia_i. Continue until first rejection.

Why this is elegant:

  • If draft overestimates a token compared with target, acceptance is reduced.
  • If draft underestimates, acceptance can be 1.
  • The correction is local yet globally distribution-preserving.

3.3 First-rejection correction and residual sampling

Suppose first rejection occurs at position jj. Then we must sample a corrected token from residual distribution:

r()[pj()qj()]+r(\cdot) \propto \left[p_j(\cdot) - q_j(\cdot)\right]_+

(where $[z]_+ = \max(z,0)$).

This residual step is the heart of exactness. Intuition:

  • Accepted proposals already account for overlap between pp and qq.
  • Residual covers the part target wants but draft did not justify.

After emitting corrected token at rejection point, algorithm restarts from new prefix.

3.4 Exactness argument (practitioner-friendly explanation)

The proof idea is a telescoping composition of accept/reject events such that for each position, total probability mass assigned to each token equals target model mass.

You can think of mass decomposition:

  1. shared mass handled via accepted draft samples,
  2. leftover mass handled via residual correction.

As long as acceptance and residual are implemented exactly, resulting sample sequence is distributed as target model output.

3.5 Throughput model and speed intuition

Let:

  • CtC_t: cost of one target-model verification pass,
  • CdC_d: total cost for generating kk draft tokens,
  • AA: expected accepted tokens per iteration.

Very rough tokens-per-cost scales like:

\text{efficiency} \sim \frac{A}{C_t + C_d}

Baseline autoregressive target-only decoding has efficiency roughly 1/Ct1/C_t. Speedup needs:

  • A>1A > 1, and
  • CdC_d not too large.

Hence acceptance rate is king. This directly motivates better draft alignment strategies.

3.6 Practical knobs teams can tune

  1. Speculation length kk

    • Larger kk: more potential gain, but higher mismatch chance.
    • Smaller kk: safer acceptance, lower upside.
  2. Draft-target pairing

    • Same tokenizer strongly preferred.
    • Domain mismatch hurts acceptance sharply.
  3. Sampling policy compatibility

    • Temperature/top-p settings affect token entropy and acceptance.
    • Higher entropy typically reduces acceptance consistency.
  4. Batching regime

    • Benefit differs between single-stream latency optimization and high-throughput multi-stream serving.

4. Experiment Setup

The paper evaluates speculative decoding across language generation tasks and model pairs to quantify speedups and exactness behavior.

4.1 Models and pairings

Typical setup uses a larger target transformer and a smaller draft transformer. Key variable is scale gap:

  • too small draft may be inaccurate,
  • too large draft may be expensive.

Best regime tends to be “cheap but not random” draft.

4.2 Baselines

Core baseline is standard autoregressive decoding on target model alone. Some analyses also compare with other acceleration or batching variants, but the central claim is speedup relative to exact target sampling.

4.3 Metrics

Main metrics:

  1. Wall-clock speedup (tokens/s or latency reduction).
  2. Acceptance-related statistics (accepted prefix length, rejection location distribution).
  3. Output distribution fidelity (theoretical exactness + empirical checks).

4.4 Hardware and implementation factors

Important for reproducibility:

  • GPU memory bandwidth and kernel efficiency matter heavily.
  • Framework-level overhead (Python dispatch, synchronization) can distort practical gains.
  • KV cache handling and tensor shape packing affect verification efficiency.

4.5 Figure/table reading guide

When reading experimental tables in this paper family, inspect in this order:

  1. Draft/target size ratio,
  2. Speculation length kk,
  3. Acceptance rate,
  4. Final speedup.

If speedup is lower than expected, usually one of these is off:

  • draft mismatch too high,
  • speculation too aggressive,
  • serving stack overhead dominates.

5. Results & Analysis

5.1 Speedup is real, but conditional

The headline result is substantial inference acceleration (often around 2×, with some settings higher). My interpretation:

  • This is not a free lunch.
  • Gains come from statistical alignment between draft and target plus systems implementation quality.

In well-paired regimes, accepted-token chunks are long enough to amortize target passes. In weak-pair regimes, frequent rejections collapse gains.

5.2 Exactness is the strategic advantage

Even if an approximate method might sometimes be slightly faster, speculative decoding’s exactness can be more valuable long-term:

  • fewer surprises in A/B tests,
  • easier safety/compliance argumentation,
  • no silent drift in generation policy.

For product teams, this means lower integration risk.

5.3 Acceptance dynamics explain almost everything

From a control perspective, acceptance rate is the key state variable. It captures:

  • draft quality,
  • entropy of current generation context,
  • compatibility with decoding hyperparameters.

I recommend monitoring:

  • average accepted tokens per step,
  • rejection depth histogram,
  • per-domain acceptance (code vs dialogue vs summarization).

This helps decide whether to tune draft model, adjust kk, or route traffic adaptively.

5.4 Interaction with temperature and nucleus sampling

In higher-temperature settings, next-token uncertainty increases. Draft guesses become less aligned with target probable mass, so acceptance may fall. This means speculative decoding often shines most in deterministic or moderately stochastic generation regimes.

For beginner readers: randomness and speedup are often in tension here. You can still use both, but need careful calibration.

5.5 Systems-level implications beyond this paper

This paper influenced later verification-style acceleration families:

  • multi-level drafting,
  • tree/speculative branching,
  • dynamic speculation length policies,
  • hardware-aware scheduling for verifier kernels.

So I view it as foundational—not merely one isolated trick.

5.6 Failure-pattern analysis (my practical perspective)

Common degradation patterns:

  1. Domain shift: draft trained on generic corpus, production prompts are code-heavy/legal-medical.
  2. Tokenizer mismatch edge-cases: nontrivial token boundary differences break assumptions.
  3. Long-context instability: acceptance drops in highly constrained long-context generations.
  4. Tool-use constrained outputs: JSON/XML strict format can cause repeated local mismatches.

Mitigation tactics:

  • domain-adapt draft,
  • adaptive kk by confidence,
  • fallback thresholds to target-only decoding,
  • route strict-format requests to conservative profiles.

6. Limitations & Boundary Conditions

6.1 Draft dependence

If draft model is poor, speculative decoding may offer little or negative gains after overhead. So this is not universally guaranteed acceleration.

6.2 Additional serving complexity

Compared with plain decoding, you now maintain two models plus acceptance/correction logic. Operational complexity increases:

  • model lifecycle management,
  • monitoring and alerting on acceptance health,
  • debugging two-model interactions.

6.3 Memory pressure and kernel complexity

Running two models can increase memory pressure. On constrained GPUs, this may reduce concurrency or force smaller batch sizes, offsetting benefits.

6.4 Extreme sampling regimes

Very high-entropy decoding can reduce acceptance too much, making speculative mode less effective.

6.5 Proof assumptions vs production details

The theoretical guarantee assumes exact implementation of probability calculations and sampling steps. Numerical shortcuts, precision issues, or heuristic hacks can silently break exactness.

I strongly recommend building explicit invariance tests in CI for inference kernels.

6.6 Fairness and safety side note

Even with exactness relative to target model, if target model itself has harmful biases, speculative decoding does not fix that. It is a speed method, not an alignment method.


7. Reproducibility & Practical Notes

7.1 Can we reproduce from paper alone?

Conceptually yes, but high-performance reproduction requires careful systems engineering. A naive implementation may fail to realize headline speedups.

Phase A: Offline validation

  1. Implement exact acceptance + residual sampling with unit tests.
  2. Verify token distribution equivalence on fixed prompts and seeds (statistical tests).
  3. Profile latency decomposition: draft time, verify time, correction overhead.

Phase B: Shadow traffic

  1. Run speculative mode in shadow behind target-only serving.
  2. Compare latency percentiles, acceptance metrics, and output equivalence diagnostics.
  3. Trigger fallback when acceptance drops below threshold.

Phase C: Controlled rollout

  1. Start with low-entropy workloads (summaries, deterministic transforms).
  2. Enable dynamic kk policy.
  3. Expand gradually to broader tasks.

7.3 Suggested observability metrics

At minimum:

  • accepted_tokens_per_iteration,
  • rejection_position_histogram,
  • speculative_speedup_ratio,
  • fallback_rate,
  • verifier_gpu_utilization,
  • end-to-end p50/p95 latency.

Without these, teams are flying blind.

7.4 Capacity planning for beginners

Rule-of-thumb guidance:

  • If your target model inference is clearly the bottleneck and you can host a small draft model nearby, speculative decoding is worth trial.
  • If your bottleneck is network, tokenization, or downstream tool latency, gains may be limited.

7.5 Security/compliance angle

Because distribution is preserved, auditing becomes simpler: behavior differences are less likely to come from decoding approximation drift. This can matter for enterprise governance reviews.

7.6 My bottom-line practitioner verdict

I would recommend speculative decoding in production if:

  1. you can afford operational complexity of two-model serving,
  2. you implement exactness-critical components rigorously,
  3. your measured acceptance rates are healthy in real traffic.

For many LLM systems, this paper’s method is one of the highest ROI inference optimizations available before model architecture changes.


Appendix A: Beginner-friendly analogy

Imagine a senior lawyer (target model) and a trained junior assistant (draft model) writing a contract.

  • Junior drafts several clauses quickly.
  • Senior reviews in one pass.
  • Correct clauses are kept.
  • First problematic clause is rewritten by senior.

You get senior-level quality, but faster throughput than having senior write every line from scratch. That is speculative decoding in human terms.


Appendix B: Implementation anti-patterns I have seen

  1. Using different post-processing transforms on draft and target logits.
  2. Ignoring numerical stability in ratio computation.
  3. Mixing sampling seeds in a way that breaks reproducibility checks.
  4. Overly large kk with no adaptive control.
  5. Not instrumenting rejection causes.

Each of these can erase expected gains.


References

  1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
  2. Vaswani, A. et al. (2017). Attention Is All You Need.
  3. Kwon, W. et al. (2023). vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention.
  4. Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
  5. Pope, R. et al. (2022). Efficiently Scaling Transformer Inference (related systems context).

Review written on 2026-03-11.