0%

vLLM and PagedAttention: Efficient Memory Management for Large Language Model Serving — Technical Review

vLLM and PagedAttention: Efficient Memory Management for Large Language Model Serving — Detailed Technical Review

Paper: Efficient Memory Management for Large Language Model Serving with PagedAttention
Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Liang Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
Affiliation: UC Berkeley, Stanford University
Published: SOSP 2023 (arXiv: 2309.06180)
Reviewer: Zhongzhu Zhou
Review Date: February 19, 2026


I. Prerequisites: What You Need to Know

This section builds up all the foundational concepts needed to understand why vLLM matters and how PagedAttention works. Even if you are new to LLM serving systems, this section will give you the complete background.

1.1 Autoregressive Language Model Inference

Large Language Models (LLMs) like GPT, OPT, and LLaMA generate text one token at a time. Given an input prompt (e.g., "The capital of France is"), the model predicts the next token ("Paris"), appends it to the input, and repeats. This process has two phases:

Prefill phase: The model processes the entire input prompt in parallel, computing attention over all prompt tokens. This produces the initial key-value (KV) representations for the prompt.

Decode phase: The model generates one new token per forward pass. At each step, only the new token's query needs to attend to all previous tokens' keys and values. This is the bottleneck because it is sequential—you cannot generate token 5 until token 4 exists.

The decode phase is memory-bound rather than compute-bound: the GPU spends most of its time loading KV cache from memory rather than performing actual computation.

1.2 The Attention Mechanism and KV Cache

The self-attention mechanism computes:

Attention(Q,K,V)=Softmax(QKdk)V\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

During generation, at each new time step tt, we only need the new query vector qtq_t but must attend to all past keys K1:tK_{1:t} and values V1:tV_{1:t}. Recomputing all past key-value pairs at every step would be O(t2)O(t^2) per token—wasteful because the past keys and values never change.

Solution: KV cache. We store all previously computed key-value pairs in GPU memory. When generating token tt, we compute only the new ktk_t and vtv_t, append them to the cache, and use the full cache for the attention computation. This reduces generation to O(t)O(t) per token.

1.3 KV Cache Memory: Why It Dominates

The KV cache is enormous. For a single request, the KV cache size is:

KV cache size=2×L×H×dh×s×sizeof(dtype)\text{KV cache size} = 2 \times L \times H \times d_h \times s \times \text{sizeof(dtype)}

where:

  • Factor 2: one for keys, one for values
  • LL: number of Transformer layers (e.g., 96 for OPT-175B)
  • HH: number of attention heads (e.g., 96)
  • dhd_h: dimension per head (e.g., 128)
  • ss: sequence length (e.g., 2048 tokens)
  • sizeof(dtype): bytes per element (2 for FP16)

Concrete example: For OPT-13B with 40 layers, 40 heads, dh=128d_h = 128, sequence length 2048 in FP16:

2×40×40×128×2048×2 bytes=1.6 GB per request2 \times 40 \times 40 \times 128 \times 2048 \times 2 \text{ bytes} = 1.6 \text{ GB per request}

An A100 GPU with 80GB memory can hold the model weights (~26GB for OPT-13B in FP16) plus KV cache for only ~33 concurrent requests. The KV cache quickly becomes the memory bottleneck.

1.4 The Memory Waste Problem in Existing Systems

Before vLLM, LLM serving systems pre-allocated contiguous memory blocks for each request's KV cache. Since we don't know how many tokens a request will generate in advance, systems must reserve memory for the maximum possible sequence length (e.g., 2048 tokens).

This leads to three types of memory waste:

Internal fragmentation: A request that only generates 100 tokens still has 2048 tokens' worth of memory reserved. The unused 1948 slots are wasted.

External fragmentation: As requests arrive and complete, the free memory becomes scattered into non-contiguous chunks. A new request needing 500 contiguous slots might fail even though 500 slots exist in total—just not consecutively.

Reservation waste: Systems that reserve for the maximum length waste memory proportional to the difference between the maximum and actual output length.

The paper reports that existing systems waste 60–80% of KV cache memory. This directly limits throughput because fewer concurrent requests can be served.

1.5 Virtual Memory and Paging in Operating Systems

The key inspiration for PagedAttention comes from how operating systems manage memory. In early computers, each program needed a contiguous block of physical memory. This led to the same fragmentation problems described above. The solution was virtual memory with paging:

  1. Virtual addresses: Each program sees a contiguous virtual address space, regardless of physical memory layout.
  2. Pages: Memory is divided into fixed-size pages (e.g., 4KB).
  3. Page table: A mapping from virtual page numbers to physical page locations.
  4. Non-contiguous allocation: Physical pages can be scattered anywhere—the page table handles the translation.

This eliminates both internal fragmentation (pages are small, so waste per allocation is bounded) and external fragmentation (any free page anywhere can be used).

PagedAttention applies this same idea to KV cache memory: instead of storing each request's KV cache in a contiguous block, it divides the cache into fixed-size "blocks" and maps them via a "block table."

1.6 Serving System Architecture and Batching

LLM serving systems handle multiple concurrent requests. Key concepts:

Dynamic batching: Instead of processing requests one at a time, the system batches multiple requests together to better utilize GPU parallelism.

Iteration-level scheduling (introduced by Orca): Rather than waiting for all requests in a batch to finish, the scheduler can evict completed requests and add new ones at each generation step. This prevents short requests from being blocked by long ones.

Throughput vs. latency: Higher throughput means serving more requests per second. The key to throughput is maximizing the number of requests that can be processed simultaneously, which requires efficient memory management.

1.7 Copy-on-Write (CoW)

Copy-on-write is an OS optimization technique. When a process forks, the child initially shares the parent's memory pages. Only when either process tries to write to a shared page is a copy made. This saves memory when data is mostly read-only.

In LLM serving, this is relevant for parallel sampling and beam search, where multiple output sequences share the same prompt. Their KV cache for the prompt tokens is identical and can be shared until they diverge.


II. What This Paper Does: Core Contributions

2.1 The Core Problem

Existing LLM serving systems waste 60–80% of KV cache memory due to fragmentation and over-reservation. Since the KV cache is the primary memory bottleneck, this waste directly translates to lower throughput—fewer requests can be batched together, and the GPU is underutilized.

2.2 Three Key Contributions

  1. PagedAttention: A new attention algorithm that stores KV cache in non-contiguous, fixed-size blocks and accesses them through a block table—directly inspired by OS virtual memory paging.

  2. Block-level memory management: A memory manager that allocates, frees, and shares KV cache blocks on demand, eliminating both internal and external fragmentation.

  3. vLLM serving system: An end-to-end LLM serving engine built around PagedAttention, achieving 2–4× throughput improvements over state-of-the-art systems.


III. Methodology in Depth

3.1 PagedAttention: The Core Algorithm

PagedAttention partitions each sequence's KV cache into blocks, where each block holds the key-value pairs for a fixed number of tokens BB (the block size, e.g., B=16B = 16).

For a sequence with ss tokens, the KV cache requires s/B\lceil s/B \rceil blocks. These blocks need not be contiguous in physical GPU memory.

How attention is computed: Given query vector qtq_t for the new token, and the KV cache stored in blocks {(Kj,Vj)}j=1t/B\{(K_j, V_j)\}_{j=1}^{\lceil t/B \rceil}, the attention output is:

Aj=exp(qtKj/dh)jexp(qtKj/dh)A_j = \frac{\exp(q_t K_j^\top / \sqrt{d_h})}{\sum_{j'} \exp(q_t K_{j'}^\top / \sqrt{d_h})}

ot=jAjVjo_t = \sum_j A_j V_j

The computation proceeds block-by-block:

  1. For each block jj, compute the attention scores qtKj/dhq_t K_j^\top / \sqrt{d_h}
  2. Track the running maximum for numerical stability (online softmax)
  3. Compute weighted sum of values across all blocks

This is essentially standard attention but with key-value pairs gathered from potentially non-contiguous memory locations via the block table.

3.2 Block Table: The Translation Layer

Each sequence maintains a block table that maps logical block indices to physical block locations:

BlockTable[sequence_id][logical_block_idx]=physical_block_ptr\text{BlockTable}[\text{sequence\_id}][\text{logical\_block\_idx}] = \text{physical\_block\_ptr}

Example: Suppose we have a prompt "Four score and seven years ago our" tokenized into 7 tokens, with block size B=4B = 4:

  • Logical block 0 (tokens 0–3): might be stored at physical block 7
  • Logical block 1 (tokens 4–6): might be stored at physical block 1

When new tokens are generated, they fill the current last block. When the last block is full, a new physical block is allocated and appended to the block table.

On-demand allocation: Unlike pre-allocation schemes, vLLM allocates new blocks only when needed. This means a request only uses memory proportional to its actual current sequence length, not its maximum possible length.

3.3 Memory Management

The block manager maintains a free block pool—a list of all unallocated physical blocks. Operations:

  • Allocate: Pop a block from the free pool and assign it to a sequence.
  • Free: When a sequence completes, return all its blocks to the free pool.
  • Share: For parallel sampling/beam search, multiple sequences can point to the same physical block via their block tables.
  • Copy-on-Write: When a shared block needs to be modified (i.e., a new token is written to it), a copy is made first. Reference counting tracks how many sequences share each block.

Near-zero waste: Internal fragmentation is bounded by B1B - 1 tokens per sequence (the last block may not be full). With B=16B = 16 and sequences of hundreds of tokens, this waste is negligible (~1%). External fragmentation is eliminated entirely because any free block can be used for any request.

3.4 Supporting Decoding Algorithms

vLLM abstracts all decoding algorithms through three primitive operations:

fork(seq): Create a new sequence that shares the existing sequence's KV cache blocks. The reference counts of shared blocks are incremented. Used at the start of parallel sampling or beam search.

append(seq, token): Append a new token to the sequence. If the last block is full, allocate a new block. If the last block is shared (reference count > 1), copy-on-write: allocate a new block, copy the old block's data, then append.

free(seq): Delete a sequence and decrement reference counts of its blocks. Blocks with reference count 0 are returned to the free pool.

Parallel sampling: Given a prompt, we fork it nn times to create nn parallel sequences. They all share the prompt's KV cache blocks. As each sequence generates different tokens, copy-on-write creates separate blocks only for divergent tokens.

Beam search: More complex sharing patterns where beams may share variable-length prefixes. As beams are pruned and expanded, fork and free operations dynamically manage the sharing.

3.5 Scheduling Policy

vLLM uses First-Come-First-Served (FCFS) scheduling with the following mechanism:

  1. When GPU memory is insufficient for a new request, vLLM does not preempt running requests. Instead, new requests wait in a queue.
  2. When all running requests are blocked (insufficient memory), vLLM applies preemption: the most recently arrived request is paused and its blocks are either (a) swapped to CPU memory or (b) recomputed later.
  3. Preempted sequences are rescheduled when memory becomes available.

This approach ensures fairness (earlier requests get priority) and prevents starvation.

3.6 GPU Kernel Optimizations

The paper describes three critical kernel-level optimizations:

1. Fused reshape and block write: When a new KV pair is computed, it must be reshaped and written to the correct block via the block table. These operations are fused into a single GPU kernel to minimize launch overhead.

2. Fused block read and attention: The attention computation must read KV data from non-contiguous blocks. The paper adapts FasterTransformer's attention kernel to read KV cache according to the block table and perform attention on the fly. Each GPU warp (32 threads) is assigned to read one block, ensuring coalesced memory access.

3. Fused block copy: Copy-on-write operations may involve copying multiple non-contiguous blocks. Instead of issuing separate cudaMemcpyAsync calls for each block, a custom kernel batches all copy operations into a single kernel launch.


IV. Experimental Results

4.1 Setup

Models: OPT-13B (1 A100 GPU), OPT-66B (4 A100 GPUs), OPT-175B (8 A100 GPUs), LLaMA-13B (1 A100 GPU).

Datasets:

  • ShareGPT: real ChatGPT conversations (long, high variance; avg input 161 tokens, avg output 338 tokens)
  • Alpaca: instruction-following dataset (short, low variance; avg input 19 tokens, avg output 58 tokens)

Baselines:

  • FasterTransformer: Optimized inference engine with static batching (max batch size pre-set).
  • Orca (Max): Reserves maximum sequence length (2048) for every request.
  • Orca (Pow2): Over-reserves by at most 2× (e.g., 32 slots for 25 actual tokens).
  • Orca (Oracle): Knows the exact output length in advance—an infeasible upper bound.

4.2 Basic Sampling (Single Output per Request)

On the ShareGPT dataset, vLLM sustains:

  • 1.7×–2.7× higher request rates than Orca (Oracle)—the infeasible upper bound
  • 2.7×–8× higher request rates than Orca (Max)
  • Up to 22× higher request rates than FasterTransformer

On OPT-13B with ShareGPT (2 req/s): vLLM batches an average of 30.42 requests simultaneously, compared to 13.62 for Orca (Oracle), 9.81 for Orca (Pow2), and 7.00 for Orca (Max). This 2.2–4.3× improvement in batch size directly translates to proportional throughput gains.

On the Alpaca dataset (shorter sequences), vLLM still demonstrates significant improvements. The one exception is OPT-175B on Alpaca, where the large GPU memory (8 A100s) and short sequences mean even Orca can batch many requests—the bottleneck becomes compute rather than memory.

4.3 Parallel Sampling

With parallel sampling (multiple output sequences per prompt), vLLM's memory sharing via copy-on-write becomes increasingly beneficial:

Memory savings from KV block sharing:

  • Parallel size 2: 6.1% memory saving
  • Parallel size 4: 8.5% memory saving
  • Parallel size 6: 9.8% memory saving

vLLM's throughput advantage over Orca grows with the parallel size because the shared prompt KV cache avoids redundant storage.

Beam search offers even greater sharing potential because beams frequently share long prefixes:

Memory savings from KV block sharing:

  • Beam width 2: 37.6% memory saving
  • Beam width 4: 53.1% memory saving
  • Beam width 6: 55.2% memory saving

With beam width 6, vLLM achieves 2.3× higher throughput than Orca (Oracle) on OPT-13B/Alpaca, compared to only 1.3× for basic sampling. The improvement scales with the amount of potential sharing.

4.5 Chatbot Serving (LLaMA-13B)

Using LLaMA-13B on ShareGPT traces, vLLM achieves 2× higher throughput than Orca systems while maintaining the same latency. This demonstrates practical real-world applicability.


V. Analysis: Why PagedAttention Works So Well

5.1 Eliminating Memory Waste

The fundamental reason for vLLM's superiority is waste elimination:

System Internal Fragmentation External Fragmentation Reservation Waste
Orca (Max) None Severe Extreme (reserves max length)
Orca (Pow2) None Moderate Moderate (up to 2× over-reserve)
Orca (Oracle) None Moderate None (infeasible)
vLLM B1B-1 tokens None None

vLLM is the only system that achieves near-zero waste across all three categories, and it does so without requiring knowledge of future output lengths.

5.2 Memory Utilization Translates to Throughput

Higher memory efficiency → larger batch sizes → better GPU utilization → higher throughput. The relationship is approximately linear: a 2× reduction in per-request memory waste allows ~2× more concurrent requests, which delivers ~2× throughput improvement (until compute becomes the bottleneck).

5.3 The "Oracle" Comparison

The most striking result is that vLLM outperforms Orca (Oracle)—a system with perfect knowledge of output lengths. This happens because:

  1. Orca (Oracle) still suffers from external fragmentation (contiguous allocation leaves gaps).
  2. vLLM's block-level allocation eliminates external fragmentation entirely.
  3. vLLM's internal fragmentation (<B< B tokens) is much smaller than Orca's external fragmentation.

VI. Limitations and Boundary Conditions

6.1 Block Size Sensitivity

The choice of block size BB involves a tradeoff:

  • Small BB: Less internal fragmentation but more block table entries and higher bookkeeping overhead.
  • Large BB: Lower overhead but more wasted memory in the last block.

The paper uses B=16B = 16 as a reasonable default but does not extensively study the sensitivity.

6.2 Attention Kernel Overhead

Non-contiguous memory access patterns in PagedAttention are inherently less cache-friendly than contiguous access. The paper reports that PagedAttention's kernel is 20–26% slower than FasterTransformer's optimized kernel for individual attention computations. However, this overhead is amply compensated by the throughput gains from larger batch sizes.

6.3 CPU Overhead of Block Management

The block manager, scheduler, and block table operations run on the CPU. At extremely high request rates, CPU scheduling could become a bottleneck. The paper does not explore this regime in detail.

6.4 Limited to KV Cache Optimization

vLLM optimizes only the KV cache memory. It does not address memory optimization for model weights, activations, or optimizer states during training. It is a serving-time optimization.

6.5 Single-Node Scope

The paper evaluates up to 8 GPUs on a single node. Distributed serving across multiple nodes with PagedAttention is left to future work. (Note: subsequent vLLM releases have added distributed support.)

6.6 No Quantization Integration

The paper does not combine PagedAttention with KV cache quantization (e.g., FP8 or INT4 KV cache). Combining these techniques could yield multiplicative memory savings.


VII. Impact and Significance

7.1 Paradigm Shift in LLM Serving

vLLM and PagedAttention fundamentally changed how the community thinks about LLM serving. Before this work, memory management was considered a secondary concern compared to compute optimization. This paper demonstrated that memory management is the primary bottleneck for LLM serving throughput and that solutions can be borrowed from decades-old OS concepts.

7.2 Industry Adoption

vLLM has become one of the most widely adopted open-source LLM serving frameworks, with adoption by major companies and cloud providers. Its API compatibility with OpenAI's interface made it a drop-in replacement for many deployments.

7.3 Research Influence

PagedAttention inspired a wave of follow-up work on KV cache optimization, including:

  • KV cache compression and quantization
  • Prefix caching across requests
  • Speculative decoding integration
  • Disaggregated prefill and decode
  • Token-level (rather than block-level) memory management

7.4 Connection to OS Principles

The paper beautifully demonstrates that classical systems concepts (virtual memory, paging, copy-on-write) can be adapted to modern ML systems. It serves as an exemplar for cross-pollination between systems and ML research.


VIII. Reproducibility

8.1 Code Availability

vLLM is open-source: https://github.com/vllm-project/vllm

8.2 System Specifications

Component Details
Framework 8.5K lines Python + 2K lines C++/CUDA
Frontend FastAPI (OpenAI-compatible API)
Hardware NVIDIA A100 GPUs (Google Cloud A2 instances)
Models OPT-13B/66B/175B, LLaMA-13B
Block size BB 16 tokens
Scheduler FCFS with swapping-based preemption
Evaluation traces 1-hour traces (15 min for OPT-175B)
Request arrival Poisson distribution with varying rates

8.3 Baseline Implementations

Since Orca is not publicly available, the authors implemented their own version with three memory allocation strategies (Max, Pow2, Oracle). FasterTransformer was used with a custom dynamic batching scheduler.


IX. Conclusion

vLLM and PagedAttention represent a landmark contribution to LLM serving systems. By recognizing that KV cache memory management is the critical throughput bottleneck and applying the elegant abstraction of OS-style paging, the paper achieves 2–4× throughput improvements over state-of-the-art systems—even outperforming systems with perfect output length prediction. The core insight that non-contiguous, block-level memory management can eliminate fragmentation without significant computational overhead has become the foundation for modern LLM serving infrastructure. The work stands as a powerful example of how classical systems principles can solve contemporary AI challenges.


References

  1. Kwon, W., Li, Z., Zhuang, S., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv:2309.06180.
  2. Yu, G.-I., et al. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022.
  3. NVIDIA. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.
  4. Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS 2017.
  5. Zhang, S., et al. (2022). "OPT: Open Pre-trained Transformer Language Models." arXiv:2205.01068.
  6. Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
  7. Tanenbaum, A. S. (2007). Modern Operating Systems. Pearson Education.