0%

An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review

Review date: 2026-05-16 Reviewer: Zhongzhu Zhou Paper: An Interpretable Latency Model for Speculative Decoding in LLM Serving Authors: Linghao Kong, Megan Flynn, Michael Peng, Nir Shavit, Mark Kurtz, Alexandre Marques (MIT and Red Hat AI) arXiv: 2605.15051v1, 2026-05-14 Status: Preprint. Experiments use GuideLLM 0.5.2 driving vLLM 0.13.0.


Short answer

This paper is the first one I have read that tries to answer a question I keep hitting in production speculative decoding (SD) post-mortems: why does my measured speedup look great at batch size one and then quietly disappear once the request rate climbs?

The literature so far has been algorithmic. Leviathan/Chen showed that SD works in isolation; EAGLE-1/2/3 and PARD made drafters stronger; Medusa and Lookahead Decoding traded the draft model for extra heads or no draft at all. All of these papers measure speedup at fixed batch size, usually one. As soon as you put a draft-verify cycle inside vLLM, the picture changes because batch size is no longer something you choose. It is something the scheduler produces from the arrival stream, and it interacts with KV-cache pressure, continuous batching, and chunked prefill in non-obvious ways.

Kong et al. propose a tiny analytical model that captures this. They observe that, for a stable pre-saturation regime, average request latency obeys a roofline-style decomposition L=C1+BC2L = C_1 + B \cdot C_2, where C1C_1 is load-independent and C2C_2 is the marginal cost per concurrent in-flight request. Little's Law lets them eliminate the unobservable batch size BB by substituting B=RPSLB = \text{RPS} \cdot L, giving the closed-form

L=C11RPSC2.L = \frac{C_1}{1 - \text{RPS} \cdot C_2}.

This is the only equation you actually need to internalize for the paper. The rest is bookkeeping: extend it to SD by writing C1C_1 and C2C_2 as sums of prefill, verify, and draft costs weighted by the expected accepted-token count EE, then add an MoE-aware correction φ(T)=1(1m/M)T\varphi(T) = 1 - (1 - m/M)^T that captures load-dependent expert coverage.

The empirical payoff is large. They sweep RPS, prefill length, decode length, acceptance rate α\alpha, draft length kk, and verifier/drafter sizes across the Qwen3 family, Llama-3.1-{8B,70B}, gpt-oss-20b, on A100 and H100, and show:

  1. All non-MoE measurements collapse onto a single universal curve y=1/(1x)y = 1/(1-x) when normalized by C1C_1 and C2C_2.
  2. SD-induced speedup obeys Speedup=(1/C1,R)(1+(1C2,R)r/(1r))\text{Speedup} = (1/C_{1,R}) \cdot (1 + (1 - C_{2,R}) \cdot r/(1-r)) where r=RPSC2,Dr = \text{RPS} \cdot C_{2,D}. The sign of speedup's slope with respect to load is determined entirely by whether C2,R<1C_{2,R} < 1 or C2,R>1C_{2,R} > 1.
  3. In real configurations C2,RC_{2,R} usually exceeds 1, which mechanistically explains the well-known "SD speedup erodes under load" phenomenon.
  4. The draft length kk^* that minimizes C1,RC_{1,R} is consistently larger than the kk that minimizes C2,RC_{2,R}. A single kk tuned at batch size one is therefore generally wrong for high-throughput serving.
  5. MoE serving deviates from dense scaling at low load because few experts are activated, and the deviation closes as load grows. The expert-coverage correction lifts R2R^2 from 0.83–0.91 to 0.97–0.99 across MoE models.

What I like about this paper is that it is a modeling contribution, not a benchmark contribution. The model fits in one screen, runs cheaply (sweep nine RPS points per config, fit with SciPy), and gives operators a knob: estimate α\alpha and RPS\text{RPS} at deploy time, then read off the best kk from eq. (3). That is much more useful than yet another "we measured X tok/s on Y hardware" table.

What I want to push back on is treated below in §6: the model is a mean-latency steady-state description. It deliberately drops everything that operators usually care about most—tail latency, preemption, bursty arrivals, queueing variance—and the framing risks giving readers a false sense of completeness. I think the right way to read this work is as a first-order serving model in the same sense that the roofline model is a first-order hardware model. It is correct in its regime and wrong outside it, and the boundary should be made louder.


1. Prerequisites

This section is for readers who have shipped LLM inference but have not internalized the queueing/throughput side of speculative decoding. I cover autoregressive decoding, prefill vs decode, continuous batching, the roofline decomposition, Little's Law, and the EAGLE-style draft-verify cycle.

1.1 Autoregressive decoding and prefill/decode asymmetry

A transformer decodes tokens one at a time: at step tt the model conditions on tokens 1,,t11, \dots, t-1 and produces a distribution over token tt, which is then sampled. Two execution phases differ markedly:

  • Prefill: the full prompt of length pp is processed in parallel in a single forward pass. The attention computation is O(p2)O(p^2) but the workload is compute-bound and GPU-friendly. Prefill cost roughly scales as O(pparams)O(p \cdot \text{params}) per request.
  • Decode: each output token is produced sequentially. Each decode step costs O(params)O(\text{params}) in FLOPs but is memory-bandwidth-bound because the entire weight matrix must be streamed from HBM to compute one token. This is the underutilization that motivates speculative decoding in the first place.

The asymmetry matters: prefill has high arithmetic intensity, decode has very low. Batching helps decode (it amortizes weight streaming over multiple sequences) but interacts with KV-cache memory pressure.

1.2 Continuous batching and chunked prefill

Naive batching waits for the slowest request to finish before processing the next batch. Modern servers do continuous batching (introduced by Orca, popularized by vLLM): each iteration the scheduler picks whatever decode steps are ready, mixes them with new arrivals, and runs one forward pass. The "batch size" you see in metrics is the effective batch, a random variable that the scheduler produces from the workload.

Chunked prefill splits a long prefill into chunks so a long prompt does not block the decode queue. Combined with continuous batching, it means batch size is no longer a control variable you tune; it emerges from the request stream. The paper's whole point is to deal with this fact head-on instead of pretending you can set batch size manually.

1.3 Roofline-style decomposition

Williams et al.'s roofline model decomposes kernel time into a fixed cost (kernel launch, weight loading) and a load-dependent cost (FLOPs or memory traffic per element). For inference servers this generalizes: per-request latency contains terms that do not depend on concurrent batch size BB (e.g., loading weights once and reusing them across the batch becomes a load-independent overhead per request in the limit) and terms that scale with BB (per-token compute and memory pressure). The paper writes this as

L=C1+BC2L = C_1 + B \cdot C_2

and asks SciPy to fit (C1,C2)(C_1, C_2) from data.

1.4 Little's Law

Little's Law is a queueing identity from Little (1961): in any stable system the average number of in-flight items BB equals the arrival rate λ\lambda times the average residence time LL:

B=λL.B = \lambda \cdot L.

It is non-parametric—it does not assume Poisson arrivals, exponential service, or any other distribution. For LLM serving the units are convenient: λ=RPS\lambda = \text{RPS} (requests per second), LL = average per-request latency (seconds), BB = average number of in-flight requests. Substituting B=RPSLB = \text{RPS} \cdot L into L=C1+BC2L = C_1 + B \cdot C_2 gives eq. (1) above. The unobservable BB vanishes.

1.5 Speculative decoding in one paragraph

Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) runs a small draft model to generate kk candidate tokens autoregressively, then runs the large verifier model once on all kk candidates in parallel. A rejection sampler accepts the longest prefix consistent with the verifier's distribution and commits those tokens; the rest are discarded and the cycle restarts. EAGLE-1/2/3 (Li et al. 2024–2025) replaces the standalone drafter with a lightweight feature-level adapter trained on the verifier's hidden states, which is what the paper uses for most experiments.

Two parameters matter: acceptance rate α\alpha (probability the verifier accepts any given drafted token, modeled as i.i.d.) and draft length kk (how many tokens the drafter proposes per cycle). The expected number of accepted tokens per cycle is

E[accepted]=1αk+11α.E[\text{accepted}] = \frac{1 - \alpha^{k+1}}{1 - \alpha}.

Speedup is bounded by 1+E[accepted]1 + E[\text{accepted}] in the ideal case, but the verifier still has to run, and the drafter is not free.

1.6 What the paper calls "load"

Throughout the paper, load = request rate = RPS, not GPU utilization or queue length. The model is fit between two endpoints per configuration: synchronous baseline (RPS small enough that every request runs alone) and the throughput ceiling just before vLLM starts preempting. Nine evenly spaced RPS values are swept. Saturation/preemption is explicitly excluded as a regime where the model does not apply.


2. Method

2.1 The base latency model

Start from the additive decomposition L=C1+BC2L = C_1 + B \cdot C_2. Little's Law gives B=RPSLB = \text{RPS} \cdot L. Solving for LL:

L(RPS)=C11RPSC2.(1)L(\text{RPS}) = \frac{C_1}{1 - \text{RPS} \cdot C_2}. \quad (1)

Properties worth noting:

  • At RPS 0\to 0, LC1L \to C_1. The intercept on the y-axis is the synchronous (batch size 1) latency.
  • As RPSC21\text{RPS} \cdot C_2 \to 1, LL \to \infty. The denominator going to zero defines the saturation boundary; the paper fits below this asymptote.
  • Normalize: define x=RPSC2x = \text{RPS} \cdot C_2 and y=L/C1y = L / C_1. Every configuration collapses to y=1/(1x)y = 1/(1-x) regardless of model size, prefill/decode lengths, or hardware.

Figure 1 of the paper shows that collapse beautifully for dense models. It is the strongest validation in the work.

2.2 Adding speculative decoding to C1C_1 and C2C_2

The same closed form fits SD configurations one (α,k)(α, k) at a time (Figure 2a), which means SD does not change the shape of the latency curve, only its parameters. But a single (C1,C2)(C_1, C_2) does not work across all (α,k)(α, k) (Figure 4a). The natural fix is to make C1C_1 and C2C_2 explicit functions of (α,k)(α, k).

Cycle anatomy: each SD cycle does (i) one verifier forward pass over k+1k+1 tokens in parallel, (ii) kk drafter steps. Prefill happens once per request. Define per-stage fixed and load-dependent costs c1,p,c1,v,c1,d,c2,p,c2,v,c2,dc_{1,p}, c_{1,v}, c_{1,d}, c_{2,p}, c_{2,v}, c_{2,d}. With gg = decode length and EE = expected accepted tokens per cycle, the number of SD cycles per request is g/Eg/E. Therefore

C1,EFF=c1,p+gE(c1,v+kc1,d),C_{1,\text{EFF}} = c_{1,p} + \frac{g}{E}(c_{1,v} + k\, c_{1,d}),

C2,EFF=c2,p+gE(c2,v+kc2,d).(3)C_{2,\text{EFF}} = c_{2,p} + \frac{g}{E}(c_{2,v} + k\, c_{2,d}). \quad (3)

Note the asymmetry: in C2,EFFC_{2,\text{EFF}} the verifier term c2,vc_{2,v} does not scale with kk, only the drafter term does. Empirically the verifier's per-step load-dependent cost stays roughly constant as kk grows, presumably because verification is dominated by memory traffic of the weights rather than the token count under continuous batching. The paper highlights this and uses the simpler form.

Plugging E=(1αk+1)/(1α)E = (1 - \alpha^{k+1})/(1-\alpha) closes the loop. Now C1,C2C_1, C_2 are explicit functions of (α,k)(α, k) through six small coefficients c,c_{*,*} per system configuration.

2.3 Speedup form and the C2,R1C_{2,R} \gtrless 1 test

Define ratios C1,R=C1,SD/C1,DC_{1,R} = C_{1,\text{SD}} / C_{1,D}, C2,R=C2,SD/C2,DC_{2,R} = C_{2,\text{SD}} / C_{2,D}, r=RPSC2,Dr = \text{RPS} \cdot C_{2,D}. Then

Speedup=LDLSD=1C1,R(1+(1C2,R)r1r).(2)\text{Speedup} = \frac{L_D}{L_{\text{SD}}} = \frac{1}{C_{1,R}} \left( 1 + (1 - C_{2,R}) \cdot \frac{r}{1-r} \right). \quad (2)

This is the workhorse formula. Three immediate consequences:

  • Zero-load speedup equals 1/C1,R1/C_{1,R}. SD essentially always achieves C1,R<1C_{1,R} < 1 (it pays fewer verifier steps), so synchronous speedup is real.
  • Slope of speedup with respect to load is controlled solely by the sign of 1C2,R1 - C_{2,R}:
    • C2,R<1C_{2,R} < 1: speedup grows with load. Achieved only at very high α\alpha.
    • C2,R>1C_{2,R} > 1: speedup shrinks with load. The common case.
  • Hyperbolic saturation: r/(1r)r/(1-r) explodes near r1r \to 1, so whichever sign the load effect has, it dominates near saturation.

This is the cleanest single explanation I have seen for why SD often "wins on the bench, loses in production." It is not a measurement artifact and it is not a drafter bug. It is a structural property of where speculation puts its cost.

2.4 The MoE correction

MoE models systematically beat the dense model's predicted synchronous latency at low load, because not all experts are touched in a single decode step. The paper writes the expected expert coverage under i.i.d. routing as

φ(T)=1(1mM)T,\varphi(T) = 1 - \left(1 - \frac{m}{M}\right)^T,

where mm is experts per token, MM is total experts, and TT is the effective routed-token count. For non-speculative inference TB=RPSLT \approx B = \text{RPS} \cdot L; for SD verification TRPSLkT \approx \text{RPS} \cdot L \cdot k since each verification step processes k+1k+1 tokens against the expert MoE. Coefficients then split into a "low-coverage" part and a "saturation increment" weighted by φ\varphi:

L=C1,u+φC1,s1RPS(C2,u+φC2,s).(4)L = \frac{C_{1,u} + \varphi C_{1,s}}{1 - \text{RPS}(C_{2,u} + \varphi C_{2,s})}. \quad (4)

The lift in fit quality (Figure 8, Section 4.6) is non-trivial: the R2R^2 on the lower-load half of the sweep climbs from 0.902 → 0.997 (gpt-oss-20b), 0.830 → 0.976 (Qwen3-30B-A3B), 0.906 → 0.989 (Qwen3-235B-A22B). That is a real improvement, especially at low load where the sparse activation effect is largest.

2.5 What's actually being fit

For a fixed model and prefill/decode pair, the dense-only fit estimates two parameters (C1,C2)(C_1, C_2). The speculation-aware fit estimates six (c1,p,c1,v,c1,d,c2,p,c2,v,c2,d)(c_{1,p}, c_{1,v}, c_{1,d}, c_{2,p}, c_{2,v}, c_{2,d}), pooled across all (α,k)(α, k). The MoE-aware fit adds coverage splits (C,uC_{*,u} vs C,sC_{*,s}) and an effective TT. Curve fitting is non-linear but well-conditioned because E(α,k)E(α, k) varies enough across the sweep to constrain the cycle-level coefficients.


3. Experimental setup

  • Driver: GuideLLM 0.5.2 sweeping RPS in nine even steps between synchronous and the throughput ceiling.
  • Server: vLLM 0.13.0 with continuous batching and chunked prefill.
  • Verifiers: Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, gpt-oss-20b, Qwen3-{0.6B, 1.7B, 8B, 14B, 30B-A3B, 32B, 235B-A22B}.
  • Drafters: EAGLE-3-style lightweight drafters for most setups; vanilla SD with Llama-3.1-8B drafting for Llama-3.1-70B and Qwen3-1.7B drafting for Qwen3-14B.
  • Prompt/decode sweep: prefill ∈ {256, 512, 768, 1024}, decode ∈ {256, 512, 768, 1024}; 16 combinations. Inputs are simulated token sequences from Pride and Prejudice.
  • SD sweep: α{0.50,0.55,,1.00}\alpha \in \{0.50, 0.55, \dots, 1.00\} (rejection sampling overridden to fix α\alpha), k{1,,10}k \in \{1, \dots, 10\}.
  • Hardware: A100 SXM (single GPU for ≤32B dense), 4× A100 for Llama-3.1-70B, 8× A100 for Qwen3-235B-A22B; H100 cross-check for dense Qwen3.
  • Fitter: SciPy curve_fit.

A practical note: by overriding rejection sampling and imposing an acceptance rate, the authors decouple measurement from drafter quality. This is the right move methodologically—it lets them isolate the effect of α\alpha from the effect of drafter architecture—but it also means real-world deployments where α\alpha is endogenous to the drafter need a separate measurement of α\alpha at runtime.


4. Key results, figure by figure

4.1 Figure 1 — universal latency collapse

For all dense configurations, normalized latency L/C1L / C_1 collapses onto 1/(1RPSC2)1/(1 - \text{RPS} \cdot C_2) across Qwen3-{0.6B, 1.7B, 8B, 14B, 32B} and Llama-3.1-{8B, 70B}. The collapse is striking because nothing in the underlying physics says it should work this cleanly: KV-cache pressure, attention quadratic blow-up, scheduler quirks could all bend the curve. The clean fit is the empirical justification for using Little's Law in the first place.

4.2 Figure 2 — SD preserves the shape, changes the parameters

(a) The Little's-Law curve still fits each (α,k)(α, k) separately. (b) For Qwen3-8B at 1024/1024, C1,R<1C_{1,R} < 1 uniformly across settings (SD reduces fixed cost), but C2,RC_{2,R} usually exceeds 1, with the exception of α0.9\alpha \geq 0.9. The plot also shows that the kk that minimizes C1,RC_{1,R} is much larger than the kk that minimizes C2,RC_{2,R}, foreshadowing the load/no-load tradeoff.

4.3 Figure 3 — minimum C1,RC_{1,R} and C2,RC_{2,R} across prefill/decode

For each (α,prefill,decode)(α, \text{prefill}, \text{decode}), take the best kk for C1,RC_{1,R} and best kk for C2,RC_{2,R}. C1,RC_{1,R} stays under 1 everywhere—SD's synchronous benefit is universal. C2,RC_{2,R} is still above 1 in most configurations except the very high α\alpha corner. The takeaway: synchronous speedup does not generalize to high RPS unless α\alpha is unusually large or kk is reduced.

4.4 Figure 4 — single-vs-speculation-aware fits

(a) A single (C1,C2)(C_1, C_2) pair fitted across all (α,k)(α, k) misses badly; configurations diverge. (b) Fitted per-config (C1,C2)(C_1, C_2) values vary smoothly in (α,k)(α, k). (c) The speculation-aware C1,EFF,C2,EFFC_{1,\text{EFF}}, C_{2,\text{EFF}} collapse all (α,k)(α, k) onto one curve. (d) C,EFFC_{*,\text{EFF}} values from the unified fit track the per-config (C1,C2)(C_1, C_2) closely, validating the parameterization.

4.5 Figure 5 — scaling of cycle coefficients with verifier/drafter size

Across Qwen3-{0.6B, 1.7B, 8B, 14B, 32B} the cycle-level coefficients scale approximately linearly with verifier parameter count for c1,p,c1,v,c2,p,c2,vc_{1,p}, c_{1,v}, c_{2,p}, c_{2,v} and with drafter parameter count for c1,d,c2,dc_{1,d}, c_{2,d}. The cleanest scaling is c1,vc_{1,v} (verifier fixed cost) — nearly perfectly linear in verifier size, as expected from weight loading dominating per-request overhead.

4.6 Figure 6 — length dependence

c2,pc_{2,p} is roughly linear in prefill length with model-specific slopes. c2,vc_{2,v} scales with an effective token count prefill+0.5decode\text{prefill} + 0.5 \cdot \text{decode}, which is exactly the time-averaged KV-cache size during decode. This is a nice mechanistic detail.

4.7 Figure 7 — leave-nn-out generalization

Train on 16n16 - n of the prefill/decode combinations, predict on the remaining nn for n{1,2,4}n \in \{1, 2, 4\}. Held-out R2R^2 stays near full-data fit for the bigger Qwen3 models (≥8B): 0.97\geq 0.97 for c2,pc_{2,p} and c2,vc_{2,v} across nn. The scaling trends are predictive, not just descriptive.

4.8 Figure 8 — MoE-aware lift

For gpt-oss-20b, Qwen3-30B-A3B, and Qwen3-235B-A22B, the expert-coverage correction reduces residuals at low RPS where sparse activation has the biggest leverage. The lift is largest when kk is small, because small kk means few routed tokens, which means few experts covered, which means the most deviation from dense scaling.


5. What I learned that I will actually use

Three things I am taking away into my own work.

Use Little's Law to escape the batch-size trap. The reason every SD paper I have read disagrees on speedup numbers is that they each pick different fixed batch sizes. There is no "right" batch size in continuous-batching servers. Sweeping RPS and fitting L=C1/(1RPSC2)L = C_1 / (1 - \text{RPS} \cdot C_2) removes the dependence on this ill-defined knob. I will steal this for any future serving evaluation.

Tune kk as a function of load. The model gives an explicit prediction: at high load, smaller kk wins because the verifier cost dominates and large kk inflates C2,RC_{2,R}. This is the opposite of the conventional advice from batch-size-one SD papers, which prefer large kk to amortize verifier overhead. A serving controller could in principle estimate α\alpha and RPS\text{RPS} continuously and pick kk from a small lookup table. The paper hints at this; I think it is the most actionable deployment finding.

C2,R>1C_{2,R} > 1 is the operational red flag. If you profile a draft-verify configuration and the fitted C2,RC_{2,R} exceeds 1, you have a synchronous-only optimization. It will lose to vanilla decoding past some load threshold, and the threshold is computable from r=(C2,R1)/(C2,R(C1,R)1+(C2,R1))r^* = (C_{2,R} - 1)/(C_{2,R} \cdot (C_{1,R})^{-1} + (C_{2,R} - 1)) derived by setting speedup to 1 in eq. (2). I would compute this number for every SD deployment going forward.


6. Limitations and where I'd push back

6.1 Mean latency only

The model targets average LL. It does not say anything direct about p95 or p99, both of which matter more in production. Section A.5 (referenced in the body but not in the part I quoted) says the model "remains a good approximation for p95 latency and for p99 latency, especially on larger models," which I read as: it sort of works but it is not the right tool. Tail latency is dominated by scheduling variance, which Little's Law explicitly averages away. I would not deploy an SD configuration based on this model alone if my SLO is on p99.

6.2 Pre-saturation regime only

The fit explicitly excludes the throughput ceiling. In production the most interesting regime is exactly the saturation regime—that is where SLOs break. The paper is honest about this ("we exclude the saturation boundary, where preemption makes latency unstable") but the limitation should be in the abstract, not the appendix. A reader could plausibly believe the model covers their target operating point when it does not.

6.3 i.i.d. acceptance assumption

Eq. (3) uses E=(1αk+1)/(1α)E = (1 - \alpha^{k+1})/(1-\alpha) which assumes per-token acceptance is i.i.d. with constant α\alpha. Real drafters have position-dependent acceptance (acceptance drops as tokens get further from the verifier's last hidden state), correlated rejections, and sequence-conditioned variability. The paper enforces a constant α\alpha by overriding rejection sampling, so the model is internally consistent, but its extrapolation to "what will my real drafter do in production" requires an independent measurement of effective α\alpha.

6.4 Single-GPU bias

Most experiments run on a single A100 or H100. Llama-3.1-70B uses 4× A100 and Qwen3-235B-A22B uses 8× A100, but the multi-GPU regime is otherwise unexplored. Tensor parallelism, pipeline parallelism, and especially expert parallelism in MoE change the load-dependent cost structure substantively. I would want to see the MoE-aware extension tested under expert-parallel routing (Tutti, DisagMoE) before trusting eq. (5) on a frontier MoE deployment.

6.5 Adaptive drafting and tree verification not covered

EAGLE-2 and dynamic lookahead change kk at runtime; tree verification (Medusa, SpecInfer) replaces flat kk with a tree of candidate continuations. The paper's framework absorbs these "through additional cost terms" but does not test them. The most-deployed SD variants in 2026 use adaptive drafting, so this is a real gap.

6.6 No closed-loop control

The model gives a static k(α,RPS)k^*(α, \text{RPS}) mapping but does not propose a controller. In a real server, αα and RPS\text{RPS} are estimated with noise and lag, and switching kk has cost (it changes kernel selection and KV-cache budget). I would want to see a stability analysis of a simple controller that picks kk from a sliding-window estimate of αα and RPS\text{RPS}. This is the natural follow-on paper.

6.7 What I'm convinced of anyway

Despite the above, I think the core contribution is robust. The universal latency collapse in Figure 1 is the kind of empirical regularity that is hard to fake. The speedup formula in eq. (2) gives a mechanistic explanation of a phenomenon (SD speedup erosion under load) that the field had been hand-waving. And the MoE coverage correction is a nice piece of small-physics modeling. The paper is short and useful in a way that benchmark-heavy SD papers are not.


7. Reproducibility

  • Code: The paper does not link a public repository, which is a minor disappointment given that the analytical model and SciPy fitting code would fit in one notebook. GuideLLM (Neural Magic) and vLLM are public.
  • Hardware: A100 SXM and H100 are commercially accessible; the experiment is GPU-hour-bounded rather than algorithmically complex.
  • Method: All equations, sweep ranges, and fit procedures are explicit in the body. The SciPy curve_fit recipe is standard. A diligent reader could reproduce the main fits in a weekend on a single A100.
  • Caveats: The exact vLLM version (0.13.0) matters because scheduler heuristics affect C2C_2; fitting on a newer vLLM will give different numerical coefficients but the same model structure.

8. Verdict

This is a good, sober paper that does one thing well: it gives a tiny analytical model for SD latency under realistic serving load. The math is light, the empirics are broad, and the operational implications are clear. The main weakness is the scope—mean latency, pre-saturation, i.i.d. acceptance—which the paper acknowledges but, in my view, undersells. Read it if you operate SD in production, if you evaluate SD methods, or if you write serving systems. Skip the cycle-level coefficient details if you only want the deployment guidance: it amounts to "small kk at high load, large kk at low load, check C2,RC_{2,R} in your config." I expect this paper to influence how the next round of EAGLE/PARD/Medusa benchmarks are reported. It deserves to.


Appendix A. Derivation notes

A.1 From L=C1+BC2L = C_1 + B C_2 to eq. (1)

L=C1+(RPSL)C2    L(1RPSC2)=C1    L=C1/(1RPSC2)L = C_1 + (\text{RPS} \cdot L) C_2 \;\Rightarrow\; L(1 - \text{RPS} \cdot C_2) = C_1 \;\Rightarrow\; L = C_1/(1 - \text{RPS} \cdot C_2). Valid only when RPSC2<1\text{RPS} \cdot C_2 < 1, i.e., below saturation.

A.2 Speedup limit at zero load

limRPS0Speedup=limr01C1,R(1+(1C2,R)r1r)=1/C1,R\lim_{\text{RPS} \to 0} \text{Speedup} = \lim_{r \to 0} \frac{1}{C_{1,R}}(1 + (1-C_{2,R})\cdot \frac{r}{1-r}) = 1/C_{1,R}. SD's "best case" is 1/C1,R1/C_{1,R}; achieving it requires synchronous execution.

A.3 Break-even load

Setting Speedup = 1 in eq. (2): 1C1,R(1+(1C2,R)r1r)=1    r=C1,R1C1,RC2,R\frac{1}{C_{1,R}}(1 + (1-C_{2,R})\frac{r}{1-r}) = 1 \;\Leftrightarrow\; r^* = \frac{C_{1,R}-1}{C_{1,R}-C_{2,R}}. If C1,R<1<C2,RC_{1,R} < 1 < C_{2,R} then r(0,1)r^* \in (0, 1) and the SD speedup crosses unity at RPS=r/C2,D\text{RPS}^* = r^*/C_{2,D}. Above RPS\text{RPS}^*, vanilla decoding wins.

A.4 Expected accepted tokens with i.i.d. α\alpha

E=j=0kαj=(1αk+1)/(1α)E = \sum_{j=0}^{k} \alpha^j = (1 - \alpha^{k+1})/(1-\alpha). The j=0j=0 term reflects the verifier's own committed token regardless of drafter acceptance. Per-cycle decoded tokens are EE and the number of cycles per request is g/Eg/E for decode length gg.

A.5 Why the verifier cost is treated as kk-independent in C2C_2

If verification were strictly O(k)O(k) in load-dependent cost, c2,vc_{2,v} would scale with kk. Empirically the verifier's load-dependent term is roughly flat in kk because the dominant cost is streaming weights and the KV cache, not per-token compute. The paper absorbs the small kk-dependence into kc2,dk \cdot c_{2,d} instead, which fits better.


9. Worked example: a concrete deployment walkthrough

Equations are easier to trust once you push numbers through them. Here is a plausible production scenario consistent with the paper's measurements on Qwen3-8B at 1024 prefill / 1024 decode tokens on a single A100. I am inventing the exact numerics for illustration; the qualitative shape matches Figures 2–4.

9.1 Setup

Suppose profiling on the deployment gives, for the dense baseline (no SD):

  • C1,D=1.20sC_{1,D} = 1.20\,\text{s}, C2,D=0.07s/reqC_{2,D} = 0.07\,\text{s/req}.

Sanity check: at RPS=10\text{RPS} = 10, LD=1.20/(110×0.07)=1.20/0.30=4.0sL_D = 1.20 / (1 - 10 \times 0.07) = 1.20 / 0.30 = 4.0\,\text{s}. At RPS=12\text{RPS} = 12, LD=1.20/(10.84)=7.5sL_D = 1.20 / (1 - 0.84) = 7.5\,\text{s}. At RPS=13\text{RPS} = 13, LD=1.20/(10.91)=13.3sL_D = 1.20 / (1 - 0.91) = 13.3\,\text{s}. The latency wall is steep near RPS1/C2,D=14.3\text{RPS} \to 1/C_{2,D} = 14.3, exactly as Little's Law predicts.

9.2 SD configuration A: aggressive draft

EAGLE-3 drafter at k=6k = 6 with measured α=0.75\alpha = 0.75 gives E=(10.757)/(10.75)=(10.1335)/0.25=3.466E = (1 - 0.75^7)/(1 - 0.75) = (1 - 0.1335)/0.25 = 3.466 tokens per cycle. The verifier therefore runs g/E=1024/3.466295g / E = 1024 / 3.466 \approx 295 cycles per request instead of 1024. Suppose the speculation-aware fit returns:

  • C1,SD,A=0.78sC_{1,\text{SD,A}} = 0.78\,\text{s}, C2,SD,A=0.085s/reqC_{2,\text{SD,A}} = 0.085\,\text{s/req}.

Then C1,R=0.65C_{1,R} = 0.65 and C2,R=1.21C_{2,R} = 1.21. Eq. (2) becomes:

  • Speedup(0)=1/0.65=1.54\text{Speedup}(0) = 1/0.65 = 1.54. Synchronous: SD beats dense by 54%.
  • Speedup(RPS=8)=(1/0.65)(1+(11.21)×8×0.0718×0.07)=1.54×(10.21×1.273)=1.54×0.732=1.13\text{Speedup}(\text{RPS} = 8) = (1/0.65)(1 + (1 - 1.21) \times \frac{8 \times 0.07}{1 - 8 \times 0.07}) = 1.54 \times (1 - 0.21 \times 1.273) = 1.54 \times 0.732 = 1.13. At RPS 8 the speedup has eroded from 1.54 to 1.13.
  • Break-even: r=(C1,R1)/(C1,RC2,R)=(0.35)/(0.56)=0.625r^* = (C_{1,R} - 1)/(C_{1,R} - C_{2,R}) = (-0.35)/(-0.56) = 0.625, i.e., RPS=0.625/0.07=8.9\text{RPS}^* = 0.625 / 0.07 = 8.9.

Above RPS 8.9, configuration A is slower than dense decoding. If the deployment routinely sees RPS in the 9–12 range, this configuration is actively harmful at peak.

9.3 SD configuration B: conservative draft

Reduce to k=2k = 2 at the same α=0.75\alpha = 0.75: E=(10.753)/0.25=0.578/0.25=2.31E = (1 - 0.75^3)/0.25 = 0.578/0.25 = 2.31. Verifier does 1024/2.314431024/2.31 \approx 443 cycles instead. The smaller draft length raises g/Eg/E but reduces the drafter contribution per cycle. Suppose:

  • C1,SD,B=0.92C_{1,\text{SD,B}} = 0.92 s, C2,SD,B=0.072C_{2,\text{SD,B}} = 0.072 s/req.

Then C1,R=0.77C_{1,R} = 0.77, C2,R=1.03C_{2,R} = 1.03.

  • Speedup(0)=1.30\text{Speedup}(0) = 1.30. Lower than A.
  • Speedup(RPS=8)=(1/0.77)(1+(11.03)×1.273)=1.30×0.962=1.25\text{Speedup}(\text{RPS} = 8) = (1/0.77)(1 + (1 - 1.03) \times 1.273) = 1.30 \times 0.962 = 1.25.
  • Break-even: r=0.23/0.26=0.88r^* = -0.23 / -0.26 = 0.88, RPS=12.6\text{RPS}^* = 12.6.

Configuration B has a lower synchronous ceiling but its break-even is much higher. Crossover between A and B happens where their speedups are equal. Setting the two expressions equal and solving for rr in the simple linear approximation gives rcross(1.541.30)/((1.54×0.21)(1.30×0.03))=0.24/0.285=0.84r_{\text{cross}} \approx (1.54 - 1.30) / ((1.54 \times 0.21) - (1.30 \times 0.03)) = 0.24 / 0.285 = 0.84, i.e., RPScross0.84/0.07=12.0\text{RPS}_{\text{cross}} \approx 0.84/0.07 = 12.0. Hmm—that's beyond A's break-even, so in practice the cleaner statement is: below RPS 8.9 use A, above 8.9 use B (or fall back to dense). A more refined search would pick A at low RPS, B at mid-RPS, and dense at peak.

9.4 The deployment lesson

The same drafter, same verifier, same α\alpha, two values of kk, two completely different speedup curves. A controller that picks kk dynamically from (α,RPS)(\alpha, \text{RPS}) can keep speedup above 1 across the whole operating range while a single fixed-kk deployment would either underperform at low load (B) or actively lose at high load (A). This is the actionable consequence of the paper.


10. Comparison with prior LLM serving and SD analyses

It is worth situating this work against three nearby threads.

TurboSpec / SD goodput optimization (Liu et al., 2024): Goodput-style work treats SD configuration as an empirical search problem. Given a workload trace, find kk and the drafter that maximize tokens/sec under an SLO. This is operationally useful but provides no insight into why one configuration beats another. The Kong et al. model is complementary: it can be used as a fast inner loop for goodput search because it eliminates the need to actually measure every (k,α)(k, \alpha) combination on the deployment cluster.

Sarathi-Serve / chunked prefill scheduling (Agrawal et al., 2024): Sarathi-Serve is a scheduling contribution—it changes how prefill and decode are interleaved. The Kong et al. model is descriptive and treats scheduling as a black box that produces an effective C2C_2. The two are orthogonal but interact: a better scheduler changes C2C_2, which changes the break-even RPS for SD. A natural follow-on would be to fit the model on both vanilla vLLM and Sarathi-Serve and report the delta in C2C_2.

DistServe / Splitwise (disaggregated serving): Disaggregated prefill/decode changes the cost structure substantially—prefill on one pool of GPUs, decode on another. Eq. (1) still applies per pool, but C1C_1 and C2C_2 for the decode pool would be measured at the decode-pool level. SD lives in the decode pool, so the speedup analysis transfers directly, but with new numerical values. This is one of the cleanest extensions left on the table.

Roofline / arithmetic intensity literature: The Kong et al. decomposition is a natural lift of the roofline model from kernel-level to request-level granularity. Roofline says "fixed cost + per-byte cost"; Kong et al. says "fixed cost + per-concurrent-request cost." The aesthetic is consistent: keep the model linear in the cost decomposition, then let Little's Law translate concurrency to load.


11. What I would write as the next paper

If I were following this up, here are three directions ranked by what I think is most valuable.

  1. Closed-loop kk controller with stability guarantee. Estimate (α,RPS)(\alpha, \text{RPS}) from a sliding window, pick kk^* from a lookup table generated by the speculation-aware fit, and prove (or empirically demonstrate) that the controller does not chatter. Switching costs should be modeled, since changing kk changes kernel launches and KV-cache allocation. This is the single most useful follow-up for operators.

  2. Tail-latency extension. Apply the model under M/G/1-with-batch queueing assumptions and predict p95/p99 from (C1,C2,α,k)(C_1, C_2, \alpha, k). Even a coarse upper bound would be more useful than the current "p99 sort of works" appendix note.

  3. MoE expert-parallel correction. The current φ(T)\varphi(T) treats expert routing as i.i.d. uniform. Expert parallel deployments add cross-node communication whose cost depends on which experts get hit by which requests. A version of φ\varphi that accounts for non-uniform expert hot-spotting would generalize the MoE-aware fit to multi-node serving.


11.5 Misconceptions the paper corrected for me

A handful of intuitions I held going in turned out to be wrong, listed here for engineers reading fast.

Misconception 1: "Batch-size-one SD speedups extrapolate to production." This is the paper's most direct target. Zero-load speedup 1/C1,R1/C_{1,R} is just the r=0r = 0 intercept of a hyperbola; what determines the production experience is the sign and magnitude of C2,RC_{2,R}.

Misconception 2: "Larger draft length kk is always better." At batch size one, larger kk amortizes verifier overhead—this is one of the central claims of the original Leviathan paper. But Figure 2(b) shows that the kk minimizing C2,RC_{2,R} is far smaller than the kk minimizing C1,RC_{1,R}. In production both need to be picked separately.

Misconception 3: "EAGLE-3 solved the load problem for SD." EAGLE-3 gives a stronger drafter with higher α\alpha and smaller c1,d,c2,dc_{1,d}, c_{2,d}, but it does not change the basic fact that C2,RC_{2,R} must drop below 1 for the speedup to grow with load. In the paper's experiments C2,R>1C_{2,R} > 1 is still the common case, EAGLE-3 or not.

Misconception 4: "MoE models don't benefit from SD." At low load MoE benefits from sparse activation, and SD layers on top: the two speedups compose. The caveat is that φ(T)\varphi(T) rebounds as kk grows, since longer verification rounds cover more experts and erode the sparse-activation savings. The takeaway again is "pick small kk at high load."

Misconception 5: "Saturation latency can be papered over by adding GPUs." Adding GPUs typically reduces both C1,DC_{1,D} and C2,DC_{2,D}, with the fixed cost dropping faster. The curve shifts down and the saturation threshold moves right—but a configuration with C2,R>1C_{2,R} > 1 does not become C2,R<1C_{2,R} < 1 just because you doubled the hardware. Buying GPUs cannot rescue a bad SD configuration.


12. Personal verdict

I will keep the speedup formula on a sticky note. It is the one piece of analysis I was missing every time I tuned an SD configuration in vLLM, and it explains things that operators had been complaining about for a year without a satisfying answer. The paper is short, narrowly scoped, and useful. I would rather have ten of these than one more 80-page systems benchmark.


End of review.