0%

Queueing Stability for LLM Inference with KV Cache Memory Constraints

Review date: 2026-05-09
Review author: Zhongzhu Zhou
Paper reviewed: A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
Paper authors: Chengyi Nie, Nian Si, Zijie Zhou
arXiv: 2605.04595v1, 2026-05-06
Venue/status: Accepted at ICML 2026
Source used for this review: src/related-documents/papers/2605.04595-LLMInferenceQueueingStability.pdf


Short answer

This paper studies a question that is quietly becoming one of the most important questions in LLM deployment:

Given a stream of user requests, a GPU memory budget, and a distribution of prompt/output lengths, can we predict whether an LLM inference service will remain stable, or whether its queue will grow without bound?

The paper's answer is a queueing model that treats KV cache memory as a first-class resource. Instead of modeling LLM inference as if compute alone determined throughput, the authors model each request as a growing memory object: prompt chunks gradually fill the KV cache during prefill, generated tokens keep extending the cache during decode, and memory is released only after the request finishes. From this model they derive a compact stability threshold:

μ=MbˉE[g(s,o)],\mu = \frac{M}{\bar b\,\mathbb{E}[g(s,o)]},

where MM is the effective KV cache capacity measured in tokens, bˉ\bar b is a representative batch processing time, and g(s,o)g(s,o) is the lifetime cumulative KV memory demand of a request with prompt length ss and output length oo. The system is overloaded if arrival rate λ>μ\lambda > \mu, and it is stable under a work-conserving batching policy if λ<μ(1δ)\lambda < \mu(1-\delta), where δ\delta is a small memory-slack term tied to the largest single request.

The empirical story is useful. On A100 GPUs serving Meta-Llama-3-8B with vLLM v1 and chunked prefill, the predicted service rates are usually within about 10% of measured GPU throughput. That includes synthetic prefill/decode ratios, an 8-GPU data-parallel deployment, LongBench v2 request-length traces, and an appendix stress case with an 8:1 prefill/decode ratio. The paper is therefore less about making one scheduler faster and more about giving operators a principled way to estimate GPU capacity before a service falls over.

My main takeaway: this is a clean bridge between queueing theory and the engineering reality of LLM serving. It will not replace detailed load testing, because the model still depends on measured batch time and workload distributions. But it gives a simple mental model and a practical sizing formula: estimate the incoming request rate, estimate the request-length distribution, measure a representative batch time, and compute how many GPU replicas are needed for a target utilization.


1. Prerequisites

1.1 LLM inference has two phases: prefill and decode

A user request to an autoregressive LLM is not processed as one atomic job. It has two major phases.

The first phase is prefill. The model reads the input prompt and computes attention states for the prompt tokens. If the prompt has ss tokens, the system needs to process those prompt tokens before it can emit the first generated token. In real serving engines, long prompts may be split into smaller chunks so that the system can interleave many requests rather than letting one long prompt monopolize a GPU batch.

The second phase is decode. The model generates output tokens one at a time. If the output length is oo, the request takes oo decode steps after prefill. Decode is sequential: token j+1j+1 depends on token jj.

A serving system therefore sees many requests in different states at the same time:

1
2
3
4
5
request A: prompt chunk 3/8
request B: decode token 41/200
request C: waiting to start
request D: prompt chunk 1/2
request E: decode token 3/20

The paper models this mixed state explicitly. That matters because the memory used by a request is not constant over its lifetime.

1.2 What the KV cache stores and why it becomes a bottleneck

In a transformer decoder, every generated token attends to previous tokens. Without a cache, the model would repeatedly recompute key and value vectors for all previous tokens at every decode step. The KV cache avoids this waste by storing key/value vectors for tokens already processed.

The cache is great for compute efficiency, but it turns sequence length into memory pressure. A request with a long prompt or a long output keeps more key/value vectors alive. In a serving engine, many requests can be active at once, so the total KV cache footprint can dominate capacity.

For this paper, it is helpful to measure memory in normalized token units. Instead of saying “this request uses N bytes of KV cache,” the model says “this request currently occupies the equivalent of this many tokens in the KV cache.” The GPU has an effective memory budget MM in the same units.

A simple picture:

1
2
3
4
GPU KV cache budget M
|------------------------------------------------------------|
| req 1 prompt+decode | req 2 prompt | req 3 decode | slack |
|------------------------------------------------------------|

If the active requests collectively require more than MM, the system has to delay admission, swap, reject, or fail. Stability analysis asks whether, under a stream of arrivals, the backlog stays bounded.

1.3 Continuous batching and chunked prefill

Modern LLM serving systems such as vLLM do not process one request at a time. They use continuous batching: at each step, the engine builds a batch from whatever work is ready. That batch may contain prompt chunks from some requests and one decode token from many other requests.

The paper assumes chunked prefill. A prompt of length ss is split into chunks of size s^\hat{s}. If s=2048s=2048 and s^=512\hat{s}=512, the prompt contributes four chunks. Chunking has two benefits:

  • it avoids a single long prompt creating a huge one-step compute spike;
  • it makes prefill work easier to interleave with decode work.

This assumption is not just a modeling convenience. It matches a common production strategy: serve long prompts in bounded pieces so the scheduler can keep the GPU busy while limiting latency damage to other users.

1.4 Queue stability: what “stable” means here

A queue is stable when arrivals do not outrun service capacity in the long run. If the service can complete requests at a higher effective rate than users send them, the number of waiting requests fluctuates but stays bounded. If arrivals exceed service capacity, the queue grows roughly linearly and latency eventually becomes unacceptable.

The classic single-server intuition is:

stable if arrival rate<service rate.\text{stable if arrival rate} < \text{service rate}.

For LLM inference, the hard part is defining the service rate. A request is not just “one job.” Its cost depends on prompt length, output length, batch timing, and memory occupancy over time. This paper’s contribution is to define a service-rate-like quantity μ\mu that accounts for KV cache memory trajectories.

1.5 Why request length distributions matter

The paper does not only need the average prompt length or average output length. It needs the joint distribution of input and output lengths:

(s,o)Fin-out.(s,o) \sim F_{\text{in-out}}.

The joint distribution matters because prompt length and output length can be correlated. A long-context question may also require a long answer, or it may require a short answer after a long prompt. These cases stress the KV cache differently.

A capacity planner should therefore avoid thinking only in terms of “average tokens per request.” Two workloads with the same average token count can have different tails, correlations, and memory trajectories.


2. What this paper does

The paper introduces a queueing-theoretic framework for LLM inference under KV cache memory constraints. At a high level, it does four things.

First, it models an LLM serving worker as a queueing system with a GPU memory budget MM. Requests arrive randomly. Each request has a prompt length ss and an output length oo. The request first consumes memory through chunked prefill and then continues consuming memory during decode.

Second, it compresses the lifetime memory trajectory of a request into a scalar function g(s,o)g(s,o). This is not the peak memory. It is the cumulative memory demand over the request’s lifetime, measured as the area under its KV-cache-occupancy curve.

Third, it derives stability and instability conditions. The central threshold is:

μ=MbˉEs,o[g(s,o)].\mu = \frac{M}{\bar b\,\mathbb{E}_{s,o}[g(s,o)]}.

The interpretation is straightforward:

  • the GPU can erase at most about MM units of memory-work per batch step;
  • each batch step takes about bˉ\bar b seconds;
  • each arriving request adds expected memory-work E[g(s,o)]\mathbb{E}[g(s,o)];
  • therefore the maximum sustainable request completion rate is roughly M/(bˉE[g])M /(\bar b\mathbb{E}[g]).

Fourth, it validates the formula on real GPU experiments. The experiments use Meta-Llama-3-8B, vLLM v1, A100 GPUs, chunked prefill, several synthetic workload distributions, LongBench v2, and both single-GPU and 8-GPU deployments.

The part I find most useful is that the paper turns a messy serving question into a capacity-planning workflow:

1
2
3
4
5
6
1. Measure the incoming request rate λ.
2. Estimate the joint prompt/output length distribution p(s,o).
3. Measure or estimate effective KV capacity M.
4. Measure representative batch processing time b̄.
5. Compute μ = M / (b̄ E[g(s,o)]).
6. Provision enough replicas so λ is comfortably below the aggregate service rate.

That is much more operational than a purely qualitative statement like “KV cache is important.”


3. Method details

3.1 The state of a request

Each request ii has:

  • prompt length sis_i;
  • output length oio_i;
  • prefill chunk size s^\hat{s};
  • a prefill progress index ci(t)c_i^{(t)};
  • a decode progress index di(t)d_i^{(t)}.

The paper defines S(t)S^{(t)} as the set of requests that have started processing but are not complete, and W(t)W^{(t)} as the set of arrived requests still waiting to start.

During prefill, the request has processed some number of chunks. Its KV memory grows as more prompt chunks are processed:

s^,2s^,3s^,,s.\hat{s}, 2\hat{s}, 3\hat{s}, \ldots, s.

During decode, memory grows by one token per generated token:

s+1,s+2,,s+o.s+1, s+2, \ldots, s+o.

So the same request can look cheap at the beginning and expensive near the end. That is the key difference from a standard queue where a job has a fixed service requirement.

3.2 The memory constraint

At any time, the total KV memory of active requests must fit within the effective memory budget:

iS(t)(ci(t)s^+di(t))M.\sum_{i\in S^{(t)}} \left(c_i^{(t)}\hat{s} + d_i^{(t)}\right) \le M.

This formula is a simplified but useful abstraction. It says: count the currently retained prompt-cache tokens plus decode-cache tokens across active requests. The server can admit new prompt chunks only if doing so does not violate the memory budget.

The paper also discusses swapping to CPU as a practical escape hatch. In the theoretical model, the authors set MM to an effective memory limit, such as the configured vLLM utilization cap, so small fluctuations do not immediately trigger swap behavior. That is important because frequent CPU-GPU KV swapping would change the service-time model.

3.3 Lifetime cumulative memory: the function g(s,o)g(s,o)

The central object is the lifetime cumulative memory usage of one request:

g(s,o)=(s^+2s^++s)+((s+1)+(s+2)++(s+o)).g(s,o)=(\hat{s}+2\hat{s}+\cdots+s)+((s+1)+(s+2)+\cdots+(s+o)).

The first part is the prefill trajectory. If the prompt has s/s^s/\hat{s} chunks, memory grows chunk by chunk until it reaches ss.

The second part is the decode trajectory. Once prefill is done, the request has ss prompt tokens cached, then the cache grows by one for each output token.

The closed form is:

g(s,o)=(1+s/s^)s+2os+(1+o)o2.g(s,o)=\frac{(1+s/\hat{s})s+2os+(1+o)o}{2}.

This is the “area” view of KV cache usage. A request that briefly reaches a large memory footprint may be less costly than a request that stays moderately large for many decode steps. The function gg captures both height and duration.

A helpful mental picture:

1
2
3
4
5
6
7
8
9
10
11
KV memory
^
| decode: grows one token at a time
| /-------------------------------
| /
| / prefill: grows by chunks
| /
| /
+------------------------------------------------------> service steps

Area under this curve = g(s,o)

This is the paper’s most important modeling move. It replaces a complicated evolving memory trajectory with a scalar workload measure that still preserves the stability-relevant quantity.

3.4 The service-rate formula

The paper defines:

μ=MbˉEs,op(s,o)[g(s,o)].\mu = \frac{M}{\bar b\,\mathbb{E}_{s,o\sim p(s,o)}[g(s,o)]}.

Here:

  • MM is the effective GPU KV cache capacity;
  • bˉ\bar b is a representative batch processing time in seconds;
  • E[g(s,o)]\mathbb{E}[g(s,o)] is the expected lifetime cumulative memory demand per request.

The formula has a simple unit check.

  • MM is memory-tokens per batch step.
  • bˉ\bar b is seconds per batch step.
  • M/bˉM/\bar b is memory-tokens processed per second.
  • Dividing by memory-tokens per request gives requests per second.

So μ\mu is a predicted sustainable request throughput.

This is why the paper can be used for provisioning. If one GPU can sustainably process μ\mu requests per second, then GG independent replicas can process roughly GμG\mu requests per second, assuming the load balancer distributes requests well and the replicas have similar hardware and workloads.

3.5 Instability when λ>μ\lambda > \mu

Theorem 4.1 says that if the arrival rate λ\lambda is greater than μ\mu, then the total number of requests in the system grows to infinity under any scheduling policy.

The proof intuition is a conservation argument.

Incoming requests add memory-work at rate:

λE[g(s,o)].\lambda\,\mathbb{E}[g(s,o)].

The GPU can remove memory-work at rate at most:

Mbˉ.\frac{M}{\bar b}.

If incoming memory-work is larger than maximum service memory-work, the backlog must grow. No scheduler can beat the resource constraint.

I like this theorem because it separates scheduling from capacity. Better scheduling can improve latency and utilization inside the feasible region, but it cannot make an overloaded system stable if the total memory-work inflow exceeds capacity.

3.6 Stability when λ<μ(1δ)\lambda < \mu(1-\delta)

Theorem 4.2 says that under the work-conserving scheduling and batching policy described in the paper, the system is stable if:

λ<μ(1δ),\lambda < \mu(1-\delta),

where:

δ=ess sups,o(s+o)M.\delta = \frac{\operatorname{ess\,sup}_{s,o}(s+o)}{M}.

The term δ\delta is a slack factor. If the largest single request is tiny compared with GPU memory, then δ\delta is small. In that common large-memory regime, the stable condition is close to λ<μ\lambda < \mu.

The proof uses a Lyapunov function V(t)V(t), the total outstanding future memory demand of all active and waiting requests. The authors show that when the backlog is large, a work-conserving scheduler can keep GPU memory utilization at least M(1δ)M(1-\delta). Each step therefore removes at least that much old memory-work, while new arrivals add expected memory-work λbˉE[g(s,o)]\lambda\bar b\mathbb{E}[g(s,o)].

If the removal term is larger than the arrival term, the expected drift of V(t)V(t) is negative outside a bounded region. By the Foster-Lyapunov criterion, the Markov chain is positive recurrent, which is the formal queueing-theory notion of stability used in the paper.

In plainer terms:

1
2
3
4
5
When the queue is large:
the GPU should almost never be idle;
each batch erases about M units of outstanding memory-work;
arrivals add new memory-work;
if erasing is faster than adding, the queue eventually comes back down.

3.7 Why bˉ\bar b is the delicate parameter

The formula looks clean, but bˉ\bar b is not a universal constant. It depends on model, GPU, kernels, batch composition, prompt/decode ratio, and scheduling implementation.

The paper argues that near capacity, many batches have similar execution time because total KV cache usage is close to MM. Figure 1 supports this for the 1:1 prefill/decode workload: the cumulative distribution of batch execution time is tightly concentrated. Appendix Figure 5 extends the picture to 2:1 and 4:1 ratios, showing that arrival rate has little effect on batch time distribution, while the prefill/decode ratio affects variability.

This is a reasonable approximation, but it is also the main engineering caveat. If a deployment has highly variable kernels, disaggregated prefill/decode, tensor parallel communication, CPU swapping, or extreme prompt-heavy workloads, bˉ\bar b must be estimated carefully. Appendix C addresses this by using trimmed means for an 8:1 prefill/decode ratio.


4. Experiment setup

4.1 Hardware and serving stack

The experiments use:

  • eight NVIDIA A100 GPUs;
  • one GPU per replica;
  • no tensor parallelism and no pipeline parallelism;
  • Meta-Llama-3-8B as the served model;
  • vLLM v1 as the inference engine;
  • chunked prefill enabled;
  • prefill chunk size s^=512\hat{s}=512;
  • effective memory parameter M=131,000M=131{,}000 tokens.

The single-GPU experiments use one replica. The multi-GPU experiments use eight independent replicas with stateless round-robin load balancing. This is important: the 8-GPU experiment is data parallel serving, not model parallel serving.

4.2 Synthetic prefill/decode distributions

The paper evaluates three main synthetic request distributions:

Regime Prefill length ss Decode length oo Intuition
P/D 1:1 Uniform(10, 1600) Uniform(10, 1600) balanced prompt and output
P/D 2:1 Uniform(10, 2133) Uniform(10, 1066) prompt-heavy
P/D 1:2 Uniform(10, 1066) Uniform(10, 2133) decode-heavy

The representative batch times are estimated from simulated jobs and real traces:

Regime bˉ\bar b reported by the paper
P/D 1:1 0.0372 s
P/D 2:1 0.0430 s
P/D 1:2 0.0337 s

This setup lets the paper test whether the formula responds correctly when the same total token scale is distributed differently between prompt and decode.

4.3 Real request-length dataset

The paper also evaluates LongBench v2. This dataset contains 503 complex long-context questions and produces a more realistic, variable, and dependent joint distribution of prompt and output lengths.

Figure 3 shows:

  • the marginal distribution of prefill length ss;
  • the marginal distribution of decode length oo;
  • the joint cumulative density of (s,o)(s,o).

The key point is that real workloads are not clean uniform distributions. They have variance and dependency structure. The paper estimates p(s,o)p(s,o) from 80% of the dataset and tests on the remaining 20%.

4.4 Metrics

The main metric is the gap between predicted and measured service rate:

GAP=μtheoryμgpuμgpu.\text{GAP}=\frac{|\mu_{\text{theory}}-\mu_{\text{gpu}}|}{\mu_{\text{gpu}}}.

The empirical service rate μgpu\mu_{\text{gpu}} is measured after excluding warm-up and termination portions of the trace. This avoids overestimating throughput due to startup or shutdown artifacts.

The paper also plots queue length over time and waiting-time CDFs. Those plots are useful because stability is not just a number: if λ\lambda is above the service rate, the queue should visibly grow.


5. Results and analysis

5.1 Single-GPU predicted service rates are close to measured rates

Table 1 reports the main single-GPU result:

P/D ratio μgpu\mu_{\text{gpu}} μtheory\mu_{\text{theory}} GAP
1:1 3.387 3.263 3.66%
2:1 3.650 3.956 8.38%
1:2 2.969 2.902 2.25%
Mixed 2:1 \to 1:2 3.137 3.385 7.90%

These numbers support the main claim: once MM, bˉ\bar b, and p(s,o)p(s,o) are estimated, the formula predicts real GPU throughput within about 10% across several request distributions.

The mixed trace is especially interesting. The first half follows a 2:1 distribution and the second half follows a 1:2 distribution. The paper predicts long-run service rate by combining distribution-specific rates through a harmonic-like mixture:

μmix=(qμ)1.\mu_{\text{mix}} = \left(\sum_{\ell}\frac{q_\ell}{\mu_\ell}\right)^{-1}.

The measured gap remains 7.90%. That suggests the framework can handle piecewise-stationary workloads, at least when the distribution shifts are not too wild.

My interpretation: the formula is not simply fitting one workload. It captures a stable structural signal: cumulative KV memory demand explains a large fraction of serving capacity.

5.2 Queue-length plots match the stability threshold

Figure 2 plots waiting queue length under different arrival rates for the 1:1 P/D setting.

For λ=1\lambda=1 and λ=3\lambda=3, the queue remains small and bounded. This matches the measured service rate of roughly 3.3873.387 requests/sec.

For λ=5\lambda=5, 2020, and 5050, the queue grows approximately linearly. This is exactly what one expects when arrival rate exceeds service capacity. The value of the plot is that it validates not only the predicted throughput number but also the qualitative stability/instability behavior.

This distinction matters in production. A system can look fine for a short smoke test even when it is unstable; the queue may not yet have had time to explode. A stability plot over time reveals whether the backlog has a bounded shape or a persistent positive slope.

5.3 LongBench v2 shows why the joint distribution matters

Table 2 reports the LongBench v2 result:

Dataset μgpu\mu_{\text{gpu}} μtheory\mu_{\text{theory}} GAP
LongBench v2 0.610 0.561 8.03%

The service rate is much lower than in the synthetic uniform experiments because LongBench v2 contains long-context requests. More importantly, the gap is still only 8.03%.

Figure 3 explains why a distribution-aware formula is needed. The prefill and decode lengths have their own marginals, and their joint distribution is not a simple independent toy model. If a planner only used average prompt length and average output length, it could miss tail behavior and correlation. The paper’s E[g(s,o)]\mathbb{E}[g(s,o)] calculation encourages the right kind of workload measurement: collect pairs (s,o)(s,o), not isolated means.

5.4 8-GPU data-parallel deployment scales the prediction

Table 3 reports the 8-GPU result:

P/D ratio μgpu\mu_{\text{gpu}} μtheory\mu_{\text{theory}} GAP
1:1, 8 parallel GPUs 26.710 25.808 3.38%

The theoretical value is essentially eight times the single-replica prediction, and the measured result is close. This is encouraging for a common production architecture: many identical replicas behind a load balancer.

Figure 4 repeats the queue-length story at cluster scale. For λ=8\lambda=8 and λ=24\lambda=24, the total waiting queue remains bounded. For λ=40\lambda=40, 160160, and 400400, it grows linearly. The transition is consistent with an aggregate capacity of about 26--27 requests/sec.

The practical lesson is simple: if replicas are independent and requests are balanced well, the single-GPU formula can be lifted into cluster provisioning by multiplying by the number of replicas. But that conclusion should not be overextended to tensor parallel or pipeline parallel systems; the paper explicitly treats those as architectural extensions requiring different effective parameters or new queueing topologies.

5.5 Appendix figures explain when the constant-batch-time approximation is safe

Figure 5 shows CDFs of batch execution time for P/D ratios 2:1 and 4:1 at different arrival rates. The paper observes that changing λ\lambda does not strongly change the batch execution-time distribution, but increasing the prompt-heavy ratio increases variability.

Figure 6 shows waiting-time CDFs. The stable case λ=2\lambda=2 has almost no waiting. Near the measured critical rate λ=3.387\lambda=3.387, the waiting-time behavior becomes unstable in the queueing-theoretic sense. At λ=4\lambda=4 and λ=5\lambda=5, overload is clearer.

Appendix C pushes the model to an 8:1 prefill/decode ratio. Figure 7 shows a heavy tail in batch execution time: the first 70% of batches are similar, but the remaining 30% are much slower. This is where using only the median batch time would be misleading. The authors therefore test trimmed means. Table 4 reports:

Estimator for bˉ\bar b μgpu\mu_{\text{gpu}} μtheory\mu_{\text{theory}} GAP
5% trimmed mean 5.470 4.977 9.0%
10% trimmed mean 5.470 5.862 7.2%

Figure 8 visualizes the trimmed batch-time distributions. I read this appendix as a useful warning: the model can still work in heavy-tailed regimes, but the measurement procedure for bˉ\bar b becomes part of the method. Bad estimation of bˉ\bar b will produce bad capacity predictions.


6. What I think is strong about the paper

6.1 It models the right bottleneck

Many serving papers mention KV cache memory, but then evaluate throughput mostly as a scheduler or kernel problem. This paper makes KV cache memory the core of the stability model. That is the right abstraction for long-context and high-concurrency serving.

The function g(s,o)g(s,o) is particularly elegant. It captures the fact that memory pressure is both about how large a request gets and how long it occupies memory.

6.2 The formula is simple enough to operationalize

A formula is useful only if operators can measure its inputs. Here the inputs are measurable:

  • MM: effective KV cache token capacity under the model and serving configuration;
  • bˉ\bar b: representative batch processing time from traces;
  • p(s,o)p(s,o): prompt/output length distribution from request logs;
  • λ\lambda: observed or forecast arrival rate.

That makes the paper practical. A team can use the theory as a spreadsheet-level provisioning tool before running expensive full-scale experiments.

6.3 The experiments validate both numbers and dynamics

The paper does not only compare predicted μ\mu with measured μ\mu. It also shows queue-length dynamics on both sides of the threshold. This is important because the definition of stability is behavioral. The queue should stay bounded below the threshold and grow above it. Figures 2 and 4 make that visible.

6.4 The paper is honest about architecture scope

The conclusion explicitly notes that tensor parallelism can be folded in by measuring effective MM and bˉ\bar b for a TP group, while pipeline parallelism and prefill/decode disaggregation create different queueing networks. That is a good boundary. It prevents the formula from being oversold as a universal serving model.


7. Limitations and boundary conditions

7.1 The model is for stability, not tail latency SLOs

Stability is necessary but not sufficient for good service. A stable queue can still have unacceptable p95 or p99 latency if utilization is too high. The paper gives a provisioning example with target utilization ρ=90%\rho=90\%, estimating required GPUs as approximately:

λμρ.\frac{\lambda}{\mu\rho}.

In practice, latency-sensitive systems may need a lower utilization target. Stability prevents unbounded queue growth; it does not guarantee a specific response-time distribution.

7.2 The batch-time estimator is a critical dependency

The formula depends linearly on bˉ\bar b. A 20% error in bˉ\bar b produces roughly a 20% error in μ\mu. The paper's experiments show that median or trimmed mean can work, but the right estimator depends on workload shape.

For prompt-heavy, multimodal, tool-augmented, or kernel-diverse workloads, batch time may be less concentrated. A production deployment should periodically re-estimate bˉ\bar b from fresh traces rather than treating it as a fixed constant.

7.3 Output length may not be known at admission time

The theory uses the joint distribution of prompt length and output length. Prompt length is known when the request arrives; output length is not fully known until generation finishes. The model can use historical distributions or max-token settings, but online scheduling decisions may still face uncertainty.

This does not break the capacity-planning use case, but it matters for admission control. A robust production system may need conservative estimates, prediction intervals, or per-tenant workload classes.

7.4 CPU swapping and memory hierarchy are simplified

The paper discusses swap behavior and sets MM as an effective memory cap. That is reasonable for the experiments. But if a system frequently swaps KV cache between CPU and GPU, the bottleneck changes. PCIe/NVLink transfer time, CPU memory pressure, and scheduler preemption overhead would need to enter the model.

In other words, this is a model of the regime where the service tries to avoid swap as a steady-state mechanism.

7.5 The experiments focus on one model family and one serving engine

The implementation uses Meta-Llama-3-8B and vLLM v1 on A100 GPUs. That is a credible setup, but it is not exhaustive. Larger models, different attention kernels, quantized KV cache, paged cache implementations, speculative decoding, and prefill/decode disaggregation could all change effective MM and bˉ\bar b.

The theory may still apply after re-measuring parameters, but the empirical <10% gap should not be assumed automatically for every serving stack.

7.6 Scheduling policy assumptions matter

The stability theorem assumes work-conserving scheduling and maximal batching subject to memory. A scheduler that leaves GPU memory underutilized, reserves capacity inefficiently, or prioritizes requests in a way that fragments memory could perform worse. The theorem is best read as a guarantee for sensible continuous batching policies, not for arbitrary production behavior.

7.7 Large-memory regime assumption

The slack term δ\delta is small only when the largest single request is much smaller than the effective memory capacity:

ess sup(s+o)M.\operatorname{ess\,sup}(s+o) \ll M.

If individual requests can consume a large fraction of GPU memory, the gap between μ(1δ)\mu(1-\delta) and μ\mu becomes large. In that regime, capacity planning needs more conservative treatment, and admission control becomes more important.


8. Reproducibility and practical notes

8.1 What is needed to reproduce the capacity formula

To reproduce the theoretical prediction for a new deployment, I would collect:

  1. Request traces with prompt length ss and output length oo.
  2. Effective KV memory capacity MM in token units for the deployed model, precision, cache layout, and utilization cap.
  3. Chunk size s^\hat{s} used by the serving engine.
  4. Batch processing time traces under near-saturation load.
  5. Measured arrival rate λ\lambda by tenant, endpoint, or time window.

Then compute:

g(s,o)=(1+s/s^)s+2os+(1+o)o2g(s,o)=\frac{(1+s/\hat{s})s+2os+(1+o)o}{2}

for each request sample, estimate E[g(s,o)]\mathbb{E}[g(s,o)], choose bˉ\bar b, and calculate:

μ=MbˉE[g(s,o)].\mu = \frac{M}{\bar b\mathbb{E}[g(s,o)]}.

A minimal implementation could be less than a page of Python once traces are available.

8.2 A practical provisioning recipe

For an operations team, I would use the paper like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Step 1: Segment workloads.
Separate chat, coding, long-context RAG, batch summarization, etc.

Step 2: Estimate p(s,o) for each segment.
Use real request logs, not just averages.

Step 3: Run a saturation benchmark per segment.
Estimate b̄ with median or trimmed mean depending on the tail.

Step 4: Compute μ per GPU or per replica group.

Step 5: Pick a utilization target ρ below 1.
Lower ρ if latency SLOs are strict.

Step 6: Provision G >= λ / (μρ).

Step 7: Validate with queue-length traces.
Stable systems should not show persistent positive backlog slope.

The queue-length validation is important. If the formula says stable but the queue still grows, either p(s,o)p(s,o), MM, bˉ\bar b, or the scheduling assumption is wrong.

8.3 What I would add in a follow-up experiment

I would like to see four additional experiments.

First, a tensor-parallel setup where a TP group is treated as one logical worker. The paper says this can be handled by replacing MM and bˉ\bar b with effective values. It would be useful to quantify how communication overhead changes the predicted service rate.

Second, a prefill/decode-disaggregated setup. This architecture is increasingly important for LLM serving, and it creates a queueing network rather than one queue. The paper already names this as future work.

Third, a latency-focused study at different utilization targets. Stability is a binary long-run property, but operators care about p50/p95/p99 latency and time-to-first-token. Mapping λ/μ\lambda/\mu to latency distributions would make the framework more directly actionable.

Fourth, a workload-shift experiment with real diurnal traces. The mixed 2:1 to 1:2 trace is a good start, but production workloads shift by user population, time zone, prompt template, and product feature. Dynamic re-estimation of λ\lambda and p(s,o)p(s,o) would be valuable.

8.4 Code and artifact status

The arXiv page and paper text do not provide a public code repository in the extracted materials I reviewed. The experiments are described with enough detail to understand the setup: model, GPU type, vLLM version, chunk size, memory parameter, workload distributions, and measurement definition. Full reproduction would still require the authors' benchmarking harness or a careful reimplementation.

For a reader implementing the analysis independently, the formula is easy. The hard part is matching the serving engine behavior closely enough that MM and bˉ\bar b mean the same thing in theory and practice.


9. Relationship to prior systems work

This paper sits at the intersection of three lines of work.

The first is LLM serving systems such as vLLM/PagedAttention and Sarathi/Sarathi-Serve. These systems make continuous batching and memory-aware serving practical. The present paper does not replace them; it models the stability behavior of systems in that family.

The second is queueing-oriented LLM inference analysis. Prior work studies stochastic output lengths, batching policies, token clipping, elastic batching, throughput-optimal scheduling, and multi-bin batching. The paper argues that those works generally do not explicitly model the KV cache memory constraint as the key limiting resource.

The third is stochastic bin packing and queueing systems with packing constraints. The connection is natural: each active request consumes part of a memory bin. But LLM inference is not ordinary bin packing because a request’s memory footprint evolves over time, and because prefill and decode phases have different dynamics.

The paper’s contribution is to take ideas from these areas and shape them into a model that looks like LLM serving rather than a generic queue.


10. My final assessment

I think this is a strong systems theory paper because it has a useful balance:

  • the abstraction is simple enough to explain;
  • the theorem is meaningful enough to guide capacity planning;
  • the experiments are close enough to real serving stacks to be credible;
  • the limitations are visible rather than hidden.

The main idea I would carry forward is the memory-work view of LLM inference. Each request contributes an area under a KV-cache curve, not just a token count and not just a peak memory footprint. Once you think in that unit, the service-rate formula becomes intuitive.

For Charles's research interests, the paper is especially relevant because it connects ML systems, queueing theory, and inference resource planning. It gives a concrete way to talk about a problem that is often handled by ad hoc benchmarking: how many GPUs do we need before the queue becomes unstable?

My bottom-line recommendation: use this paper as a reference model for capacity planning in LLM inference, but treat it as the first layer of analysis. After computing μ\mu, still run load tests, inspect queue slopes, measure latency percentiles, and re-estimate parameters whenever workload mix or serving architecture changes.


References and follow-up reading

  1. Nie, Si, and Zhou. A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints. ICML 2026.
  2. Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
  3. Agrawal et al. Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. 2023.
  4. Agrawal et al. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. 2024.
  5. Li, Dai, and Peng. Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents. 2025.
  6. Mitzenmacher and Shahout. Queueing, Predictions, and LLMs: Challenges and Open Problems. 2025.
  7. Bai et al. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks. 2025.

Review written on 2026-05-09.