DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Deep Technical Review

1. Why This Paper Matters

If I explain this paper in one sentence:

DistServe improves LLM serving under latency SLOs by separating prefill and decoding onto different GPU pools, then jointly optimizing their resources and placement to maximize goodput per GPU.

This sounds simple, but it addresses one of the deepest frustrations in real-world LLM systems engineering:

product wants both fast first response and smooth generation,
infrastructure wants high utilization and low cost,
and existing colocated serving designs often force a painful compromise.

The paper makes this problem concrete, quantifies why the compromise happens, proposes a practical architecture, and validates it with strong end-to-end gains:

up to 7.4× higher request rate at target SLO attainment,
or 12.6× tighter SLOs at fixed rate,
while keeping latency constraints satisfied for >90% of requests.

That combination (clear diagnosis + design + measurable gains + deployment details) is why this is a serious ML systems paper.

2. Prerequisites: What you need to know first

I will treat this section as if the reader is very new to LLM serving.

2.1 What is LLM serving?

An LLM service is a system where users send prompts and receive generated text. Under the hood, a serving system must:

receive requests,
schedule and batch them,
run model inference on GPUs,
stream outputs back,
do all this cheaply and reliably.

So this is not just "run model.forward()." It is queueing + scheduling + GPU memory management + networking + latency control.

2.2 Prefill vs decoding, in plain language

Autoregressive generation has two phases:

Prefill: process the full prompt and produce the first output token.
Decoding: generate the remaining output tokens one by one.

A kitchen analogy:

prefill = all prep work before first dish appears,
decoding = serving dishes one by one after that.

They are related but have very different computational characteristics.

2.3 TTFT, TPOT, SLO, and SLO attainment

DistServe centers on two latency metrics:

TTFT (Time To First Token): user waits this long before seeing the first generated token.
TPOT (Time Per Output Token): average time per subsequent token.

Different apps care differently:

chatbot: TTFT is critical for responsiveness,
long summarization: TPOT matters more once generation starts.

An SLO is a target limit, e.g. TTFT ≤ 0.25s and TPOT ≤ 0.1s.

SLO attainment is the percentage of requests that satisfy these limits.

The paper often uses a 90% attainment target: "at least 90% of requests meet both TTFT and TPOT requirements."

2.4 Throughput vs goodput (and why they are not the same)

Throughput: total tokens or requests processed per second.
Goodput (in this paper): max request rate per provisioned GPU while meeting SLO attainment target.

You can have high throughput but bad user experience if many requests violate latency SLOs.

DistServe optimizes goodput, not raw throughput.

This distinction is essential. In production, users care about latency compliance, not just aggregate token count.

2.5 Compute-bound vs memory-bound on GPUs

A GPU kernel can be bottlenecked by:

compute (ALU/Tensor Core arithmetic), or
memory bandwidth (moving data).

In LLM inference:

prefill on long inputs often becomes compute-heavy,
decoding one token at a time often becomes memory-bandwidth-heavy because weights/KV accesses dominate.

This mismatch is a central reason prefill and decoding interfere when colocated.

2.6 Batching, queues, and why waiting time explodes

Serving systems batch requests to raise utilization.

But when short and long jobs share resources, queueing delay can explode for latency-sensitive jobs.

DistServe uses queueing-theoretic framing (M/D/1-style analysis for simplified cases) to reason about latency behavior and parallelism tradeoffs.

Even if average execution is okay, tail latency (P90/P99) can fail SLOs quickly when interference and queueing interact.

2.7 Intra-op vs inter-op parallelism

The paper uses two classic model-parallel dimensions:

Intra-operator parallelism (tensor parallel style): split heavy operations across GPUs. Good for reducing single-step execution time, but adds communication overhead.
Inter-operator parallelism (pipeline style): split layers into stages across GPUs. Good for throughput scaling, but introduces pipeline coordination and potential bubbles.

DistServe’s key point: prefill and decoding should be allowed to choose different parallelism strategies.

2.8 Why network topology changes system design

Disaggregation requires transmitting intermediate state (primarily KV cache) from prefill to decoding workers.

So network bandwidth matters:

inside node (NVLINK): very high,
across nodes (e.g., 25 Gbps in their cluster): much lower.

DistServe includes placement logic that adapts to this reality, especially in the low node-affinity case.

3. The Real Systems Problem DistServe Solves

Existing systems like vLLM and DeepSpeed-MII largely colocate prefill and decoding on the same GPU pool.

The paper identifies two root problems:

Prefill-decoding interference
- Long prefill work delays decoding, increasing TPOT.
- Decoding work also adds overhead for prefill, increasing TTFT.
Resource/parallelism coupling
- If colocated, both phases are forced into shared resource and parallelism plans.
- But prefill and decoding have different needs.

Figure 1 gives an intuitive comparison on 13B model with synthetic workload:

existing systems under SLO attainment 90% can do around 1.6 rps per GPU,
prefill-only path could do 5.6 rps equivalent,
decode-only path could do 10 rps equivalent,
ideal disaggregated provisioning (2 prefill GPUs + 1 decode GPU) implies much higher per-GPU efficiency than colocation.

The paper uses this not as final proof, but as a motivating systems observation.

4. Core Insight: Prefill and Decoding Should Not Be Forced to Live Together

DistServe’s core design choice is straightforward:

Disaggregate prefill and decoding onto different GPU instances.

Then optimize each phase independently for its own latency objective:

prefill side focuses on TTFT,
decoding side focuses on TPOT.

This yields two immediate wins:

eliminate direct prefill-decoding interference,
uncouple resource allocation and parallelism choices.

The concern is communication overhead from KV transfer.

The paper argues (and later validates) that with appropriate placement (especially stage-aware and node-aware), overhead is small relative to latency gains.

5. DistServe Architecture and End-to-End Workflow

From the paper’s runtime architecture (Figure 6):

requests enter a centralized controller,
controller dispatches to prefill instances (shortest queue),
prefill computes first token + KV cache,
decoding instances fetch KV cache and continue autoregressive generation,
results stream back.

Notable design details:

DistServe is an orchestration layer over an inference engine.
It supports model parallel execution engines and distributed KV management.
It uses a pull-based KV transfer strategy to handle burstiness and avoid decode-side memory overload.

This is important: DistServe is not just an offline planner; it includes runtime control policies.

6. Tradeoff Analysis in the Paper

6.1 Prefill instance analysis

6.1.1 Batching behavior

Figure 3(a) shows that for sufficiently long prompts, prefill can already saturate GPU compute with very small batch sizes.

So blindly increasing batch size no longer improves efficiency, and may only increase per-request latency.

The paper defines a practical threshold length (L_m):

above (L_m), prefill is compute-bound enough that additional batching can hurt TTFT.

This gives a concrete scheduling guideline for prefill.

6.1.2 Parallelism and queueing equations

For a simplified uniform case, they model prefill with queueing equations.

Baseline single-device average TTFT (M/D/1 style):

$\text{Avg\_TTFT} = D + \frac{R D^2}{2(1-RD)}$

where:

(D): per-request deterministic execution time,
(R): request arrival rate.

Then they compare inter-op and intra-op two-way settings (Eq. 2 / Eq. 3 in paper), showing a practical pattern:

at low arrival rate, intra-op can win (better execution-time reduction),
at higher rate, inter-op can win (better queueing behavior and capacity scaling).

Figure 4 visualizes this crossover on 66B experiments.

The speedup coefficient (K) for intra-op communication overhead is explicitly considered; lower (K) weakens intra-op benefit.

This analysis is valuable because it explains why there is no single globally best parallelism setting.

6.2 Decoding instance analysis

Decoding is different:

single-step decode is bandwidth-bound,
batching is critical for utilization.

Figure 3(b) and Figure 5 show:

larger decode batches improve throughput,
intra-op reduces latency but with diminishing returns,
inter-op can scale throughput close to linearly in suitable regimes.

So for strict TPOT constraints, intra-op may be needed first; beyond that, inter-op/replication improve capacity.

The paper also notes memory pressure from KV cache residency for many active decode requests.

6.3 Practical deployment concerns

Variable input lengths and pipeline bubbles

Real workloads are not uniform; mixed prompt lengths create stage imbalance and bubbles under pipeline parallelism.

DistServe addresses this with workload-aware scheduling heuristics in runtime.

Communication overhead estimate

Paper gives a concrete example:

OPT-66B, 512-token request KV cache ≈ 1.13 GB,
at 10 rps, transfer demand ≈ 11.3 GB/s (about 90 Gbps).

This motivates placement awareness; blindly crossing low-bandwidth links could hurt.

Cluster topology constraint

Their testbed has only 25 Gbps cross-node bandwidth, so low node-affinity placement strategy becomes critical.

7. Optimization Method: Placement Algorithms

DistServe’s planning objective:

maximize per-GPU goodput while meeting TTFT/TPOT SLO constraints and SLO attainment target.

7.1 High node-affinity algorithm (Algorithm 1)

Assumption: cross-node transfer is fast enough (e.g., strong InfiniBand), so placement is less constrained.

Method outline:

enumerate feasible prefill parallel configs,
simulate and binary-search maximum rate meeting SLO target,
do same for decoding configs,
replicate prefill/decoding instances to satisfy target traffic.

Complexity discussed as manageable ((O(NM^2)), with typical (M=8) GPUs per node).

7.2 Low node-affinity algorithm (Algorithm 2)

Assumption: cross-node link is relatively weak.

Key idea:

use inter-op stage segmentation,
colocate corresponding prefill/decode stages within node,
force most KV transfer through NVLINK.

This adds constraints but preserves communication efficiency.

The algorithm enumerates intra-node configurations per inter-op degree, simulates goodput under constraints, and chooses best plan.

7.3 Simulator design and why it is needed

Brute-force profiling all placements on real hardware is too expensive.

So DistServe builds an analytical simulator using latency modeling:

prefill and decode latency terms separated,
FLOPs and memory-access-inspired modeling,
coefficients fitted by profiling.

Appendix A provides formulas; for example prefill and decode latency expressions decompose major GEMM and attention costs.

Simulator accuracy is later validated (Table 2): error is within ~2% versus real system in tested settings.

7.4 Online scheduling and runtime controls

DistServe runtime includes practical policies:

FCFS baseline dispatch,
batching by token-budget heuristics to reduce bubbles,
pull-based KV transmission to absorb bursts,
periodic replanning when workload distribution shifts,
discussion of future preemption/fault-tolerance integration.

I appreciate this section because it acknowledges real operations rather than only static offline optimization.

8. Implementation Details

Implementation scale from the paper:

placement + API + orchestration: about 6.5K lines Python,
parallel execution engine: about 8.1K lines C++/CUDA.

The system integrates practical components:

OpenAI-compatible REST API,
NCCL for communication,
async intra-node CudaMemcpy,
Ray actors as distributed workers,
optimizations like continuous batching, FlashAttention, PagedAttention.

This is not toy pseudo-code; it is a system prototype with significant engineering depth.

9. Evaluation Setup

9.1 Hardware and cluster

4 nodes, 32 GPUs total.
Each node: 8× NVIDIA A100 80GB (NVLINK intra-node).
Cross-node bandwidth: 25 Gbps.

9.2 Models

OPT family (FP16):

OPT-13B,
OPT-66B,
OPT-175B.

The authors intentionally pick classic MHA-based OPT to stress KV-transfer overhead more than newer GQA/MQA models.

9.3 Workloads and applications (Table 1)

Chatbot (ShareGPT):
- 13B SLO: TTFT 0.25s, TPOT 0.1s.
- 66B SLO: TTFT 2.5s, TPOT 0.15s.
- 175B SLO: TTFT 4.0s, TPOT 0.2s.
Code completion (HumanEval):
- TTFT 0.125s, TPOT 0.2s.
Summarization (LongBench):
- TTFT 15s, TPOT 0.15s.

9.4 Baselines

vLLM: strong practical baseline with continuous batching + paged attention.
DeepSpeed-MII: includes chunked-prefill variant.

Primary metric: SLO attainment curves vs per-GPU rate / SLO scaling, usually focusing on 90% attainment target.

10. Results and What They Mean

10.1 Chatbot workloads

From Figure 8 and discussion:

DistServe sustains 2.0×–4.6× higher per-GPU request rate than vLLM.
DistServe sustains 1.6×–7.4× higher rate than DeepSpeed-MII.
DistServe tolerates 1.8×–3.2× tighter SLO than vLLM.

Why?

TPOT in colocated systems suffers from long-prefill interference.
DistServe isolates phases and picks phase-specific parallelism.
For OPT-175B chatbot case, selected prefill and decode parallelisms differ non-trivially; this is hard to tune manually and validates the search algorithm.

10.2 Code completion

Figure 9(a):

DistServe: 5.7× higher request rate than vLLM.
DistServe: 1.6× higher rate than DeepSpeed-MII.
DistServe: about 1.4× tighter SLO than both baselines.

Interpretation:

code completion has strict TTFT,
DistServe’s prefill optimization (including better parallelism and no decode interference) gives large gains.

10.3 Summarization

Figure 9(b):

DistServe: 4.3× higher rate than vLLM.
DistServe: 1.8× higher rate than DeepSpeed-MII.
DistServe: up to 12.6× tighter SLO than vLLM, 2.6× tighter than DeepSpeed-MII.

This is notable because summarization has long inputs and strong TPOT importance.

Chunked-prefill may reduce some interference but cannot eliminate coupling and can incur additional overhead.

10.4 Latency breakdown and KV transfer overhead

Figure 10 shows for OPT-175B ShareGPT:

transmission accounts for <0.1% of total latency,
95% requests have transmission delay under ~30ms.

Given their 25 Gbps cross-node link, this is strong evidence that placement-aware design made disaggregation practical in their environment.

10.5 Ablation and simulator accuracy

Table 2:

simulator vs real SLO attainment error is within about 2%.

Figure 11:

vLLM++ (better parallelism search but still colocated) shows limited gain,
DistServe variants significantly outperform, proving disaggregation is the main lever,
high-node-affinity case can further improve over low-node-affinity constrained case.

10.6 Runtime cost of planning algorithm

Figure 12:

planning runtime scales with GPU search space,
still in practical "seconds to low minutes" range,
acceptable for periodic replanning/redeployment cycles.

The paper reports largest setting under around 1.3 minutes for key search routine.

11. Why DistServe Is Strong (My Technical Judgment)

I think DistServe is strong for five reasons.

11.1 It optimizes the right objective

Many systems papers optimize throughput and then hand-wave latency. DistServe directly optimizes goodput under SLO attainment constraints, which matches real service cost/quality goals.

11.2 It identifies and isolates a real architectural conflict

Prefill and decoding are computationally asymmetric. Forcing them into one colocated scheduling space creates avoidable contention and policy coupling.

DistServe removes the coupling instead of trying to endlessly patch scheduling symptoms.

11.3 It combines analytical reasoning and empirical validation

The queueing analysis is not just decorative; it informs algorithmic design choices and aligns with measured crossovers.

11.4 It is deployment-aware

Network topology, stage co-placement, pull-based KV transfer, replanning, and algorithm runtime are all considered.

11.5 It reports gains across multiple applications and model sizes

Not just one cherry-picked setting.

12. Limitations and Boundary Conditions

No systems architecture is universal. DistServe itself discusses where it may be less ideal.

12.1 Throughput-first, latency-insensitive scenarios

If users do not care much about TTFT/TPOT tails, then maximizing raw throughput in colocated systems may be simpler and sufficient.

12.2 Small resource-constrained deployments

With only 1–2 GPUs, design space for disaggregation is tiny; orchestration complexity may dominate gains.

12.3 Dependence on workload predictability

Placement decisions use workload distributions estimated from traces. Sudden distribution shifts can reduce optimality until replanning catches up.

12.4 Runtime policy simplicity

Current FCFS can still show convoy effects in some regimes. Preemption/fault-tolerance integration is future work.

12.5 Communication assumptions are environment-dependent

Their low-node-affinity algorithm mitigates weak cross-node links, but extremely poor or unstable network fabrics can still constrain disaggregation benefits.

13. Reproducibility and Practical Deployment Checklist

If I had to implement DistServe-style serving in production, I would follow this checklist.

13.1 Measurement and profiling

Profile TTFT/TPOT under existing colocated stack.
Measure phase-level utilization and queueing delay.
Measure KV transfer size distribution and network bandwidth headroom.

13.2 Modeling and planning

Fit a latency simulator for your model family/hardware.
Validate simulator error against real runs.
Search parallelism + placement for prefill/decode independently.
Solve for required replication to hit traffic target.

13.3 Runtime policies

Implement robust request routing.
Use pull-based KV transfer in bursty traffic.
Add periodic replanning triggers on workload shift.
Monitor SLO attainment continuously, not just average latency.

13.4 Safety guardrails

Capacity buffer for sudden bursts.
Backpressure and queue protection.
Fallback path for prefill/decode instance faults.
Canary rollout for new placement plans.

13.5 Reporting metrics that actually matter

I recommend tracking all of these per model and application:

P50/P90/P99 TTFT,
P50/P90/P99 TPOT,
SLO attainment (%),
per-GPU goodput,
queue lengths and wait times by phase,
KV transfer latency and drop/retry events.

14. If You Are Building an LLM Serving Stack Tomorrow

Here is my practical guidance for engineers.

14.1 When to strongly consider DistServe-like disaggregation

your product has explicit TTFT and TPOT SLOs,
you observe prefill/decode contention in traces,
your GPU spend is high and you need better per-GPU efficiency,
your network topology supports at least good intra-node bandwidth.

14.2 When not to start with DistServe

tiny deployment with very few GPUs,
purely offline or latency-insensitive workloads,
rapidly changing workload where planning can’t stabilize.

14.3 A phased adoption path

Measure interference in current colocated stack.
Prototype partial disaggregation for one high-value model.
Add placement search and simulator incrementally.
Expand to full multi-application policy once gains are proven.

15. Conclusion

DistServe is a convincing systems contribution because it connects first principles to deployment reality.

Its central claim is simple but powerful:

For latency-SLO-constrained LLM services, prefill-decoding disaggregation plus phase-specific optimization can materially improve goodput and cost-efficiency.

The paper supports this with:

clear diagnosis of prefill-decoding interference,
principled analysis for phase-specific parallelism,
concrete placement algorithms for different cluster topologies,
implementation details beyond toy architecture,
strong end-to-end results across applications and model sizes.

My final take:

DistServe should be considered a baseline architecture whenever a production LLM service is constrained by both TTFT and TPOT SLOs and cares about per-GPU economics.

16. References

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. OSDI 2024. arXiv:2401.09670.
Woosuk Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). SOSP 2023.
Amey Agrawal et al. Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv 2023.
Bingyang Wu et al. Fast Distributed Inference Serving for Large Language Models (FastServe). arXiv 2023.
Zhuohan Li et al. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv 2023.
Pratyush Patel et al. Splitwise: Efficient Generative LLM Inference using Phase Splitting. arXiv 2023.

Review written on 2026-04-09.