Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

Review date: 2026-05-10
Review author: Zhongzhu Zhou
Paper reviewed: Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
Paper authors: Shi Qiu, Yifan Hu, Xintao Wang, Wenhao Zhu, Jianqin Yan, Hao Chen, Kaiqiang Xu, Kai Chen, Yiming Zhang
arXiv: 2605.03375v1, 2026-05-05
Venue/status: arXiv preprint, subject: Operating Systems
Source used for this review: src/related-documents/papers/2605.03375-Tutti.pdf

Short answer

This paper studies a very practical bottleneck in long-context LLM serving: how to make SSD-backed KV cache reuse fast enough that it is actually better than recomputation.

Modern serving engines such as vLLM and SGLang already use paged KV cache layouts, prefix caching, continuous batching, and tiered memory. These techniques let a system reuse previously computed key/value states instead of recomputing a long prompt every time. The difficulty is that long-context workloads can produce far more KV cache than GPU HBM or even CPU DRAM can hold. NVMe SSDs look attractive because they offer much larger and cheaper capacity, but prior SSD-backed KV cache paths often perform poorly. The paper argues that the core reason is not SSD bandwidth itself; it is the mismatch between fine-grained paged KV cache objects and CPU-centric I/O control paths.

Tutti is the authors' answer. It is a GPU-centric SSD-backed KV cache store integrated with vLLM. Its main idea is to remove the CPU from the critical data and I/O-control path between GPU HBM and NVMe SSDs. It does that through three design pieces:

a GPU-native KV cache object abstraction that lets the system move layer-wise KV objects instead of thousands of tiny disconnected blocks;
GPU io_uring, a GPU-side asynchronous I/O mechanism with submission/completion queues, I/O control blocks, and GPU-issued NVMe operations;
a slack-aware I/O scheduler that places reads and writes into profiled GPU slack windows rather than letting storage kernels interfere with inference kernels.

The headline results are substantial. Compared with a state-of-the-art GDS-enabled SSD-backed LMCache path, Tutti reduces TTFT by 78.3% under strict SLO constraints, doubles the achievable request rate, and lowers serving cost by about 27%. In the bandwidth microbenchmarks, Tutti reaches up to 25.9 GB/s retrieval bandwidth and up to 2.08× the retrieval bandwidth of LMCache-GDS. In the pipeline analysis, Tutti pushes the compute-to-I/O crossover point to a 98.3% cache hit rate, meaning that for most tested hit rates the system remains compute-bound rather than storage-bound.

My main takeaway is that Tutti is not just another cache tier. It is a storage-stack redesign around the actual shape of LLM inference. The paper is most interesting when read as a systems lesson: once inference kernels become efficient, the old assumption that the CPU can orchestrate every I/O request stops working. For long-context serving, KV cache storage must become part of the GPU execution plan.

1. Prerequisites

1.1 Why KV cache exists

Autoregressive Transformer decoders generate text one token at a time. At every decode step, the new token needs to attend to all previous tokens. If the model recomputed the key and value vectors for the entire prefix at every step, decoding would be far too expensive.

The KV cache avoids this repeated work. During prefill, the model computes key and value tensors for the input prompt and stores them. During decode, each new generated token appends another key/value entry. Future tokens can then attend to the cached entries rather than recomputing them.

A simplified view is:

prompt tokens -> prefill -> K/V tensors stored in cache
next token    -> decode  -> append one more K/V tensor
next token    -> decode  -> append one more K/V tensor
...

The tradeoff is simple but important:

KV cache saves compute.
KV cache consumes memory.
Longer prompts and longer conversations make the cache grow.
More concurrent users multiply the total cache footprint.

This is why long-context serving is both a compute problem and a memory/storage problem.

1.2 Prefix caching and why long-context services want it

Prefix caching reuses KV cache entries across requests that share a prefix. This matters for multi-turn chat, retrieval-augmented generation, agent workflows, code assistants, document QA, and applications where a long system prompt or retrieved context is reused repeatedly.

For example, suppose many requests share the same 80K-token repository context:

1
2
3

request 1: [80K shared repository context] + question A
request 2: [80K shared repository context] + question B
request 3: [80K shared repository context] + question C

If the service can reuse the 80K-token prefix KV cache, it avoids redoing a large amount of prefill work. The paper cites the general promise of prefix caching as improving SLOs and reducing per-token cost by up to an order of magnitude in modern inference services.

However, high prefix reuse makes cache capacity more valuable. A small HBM-only cache may evict useful prefixes before they can be reused. DRAM extends capacity, but the paper notes that even about 2 TB of DRAM may retain only around five minutes of KV cache at scale. SSDs can provide tens or hundreds of terabytes, so they are the natural next tier.

1.3 HBM, DRAM, SSD: same cache, very different behavior

It is tempting to view storage tiers as a simple capacity ladder:

1
2
3

HBM  -> fastest, smallest, most expensive
DRAM -> slower, larger, still expensive at scale
SSD  -> slower, much larger, much cheaper per GB

For ordinary files, this mental model is often enough. For LLM KV cache, it is not.

HBM is directly accessible by GPU kernels. DRAM is slower, but still has relatively good random-access behavior and can often be overlapped with computation through carefully designed copy kernels. SSDs provide huge capacity and strong sequential bandwidth, but they dislike large numbers of tiny random operations. Unfortunately, modern KV cache layouts often create exactly that pattern.

The paper emphasizes that the SSD tier is not slow because enterprise SSDs lack raw bandwidth. The specific Solidigm D7-PS1010 devices used in the evaluation are capable of high throughput. The problem is that paged KV cache objects are fragmented, and CPU-managed I/O turns that fragmentation into a control-path bottleneck.

1.4 Paged KV cache layout

Modern LLM serving engines use paged KV memory management. Instead of requiring every sequence's KV cache to be stored as one contiguous region, the engine divides KV cache into blocks. vLLM's PagedAttention is the canonical example.

Paged layouts are good for GPU memory utilization because user requests have different lengths and grow dynamically. The engine can allocate blocks as needed:

logical KV for one sequence:
[token 0 ... token 127]

physical blocks in HBM:
block 17, block 901, block 42, block 603, ...

The benefit is flexible allocation. The cost is fragmentation. When a long prefix is evicted to SSD and later restored, the system may have to retrieve many small pieces. The paper gives a concrete example: for a 64-layer Qwen3-32B model with block size 64, reloading a 128K-token KV cache can require about 256K scattered 80 KB objects. That is exactly the kind of access pattern that can make CPU-side I/O submission and synchronization dominate the raw transfer time.

1.5 TTFT, ITL, and why storage stalls hurt user experience

The paper evaluates serving performance mainly through two latency metrics.

Time to First Token (TTFT) measures how long it takes after a request arrives before the service emits the first token. TTFT is heavily affected by prefill and prefix-cache retrieval.

Inter-Token Latency (ITL) measures the time between generated tokens during decode. ITL affects streaming smoothness and perceived generation speed.

Storage stalls hurt both. If the GPU waits for KV cache objects to arrive from SSD, the model kernel cannot do useful work. The paper often describes this as a GPU bubble: expensive GPU resources are allocated, but they are waiting instead of computing.

A useful mental picture is:

1
2
3

ideal:       compute compute compute compute compute
bad SSD:     compute wait wait compute wait wait compute
Tutti goal:  compute + hidden I/O, with little visible waiting

The central question is therefore not merely "can we store KV cache on SSD?" The question is whether SSD-backed retrieval can be orchestrated so well that the GPU rarely sees the storage latency.

1.6 GPU Direct Storage is not the same as GPU-centric I/O

NVIDIA GPUDirect Storage (GDS) allows data to move between storage and GPU memory without a CPU bounce buffer. That removes one data-copy cost, but it does not automatically remove the CPU from the I/O control path. The CPU may still initiate and manage each I/O operation.

This distinction is crucial. The paper's critique is that GDS-enabled LMCache is still CPU-centric because each I/O must be initiated by the CPU. If the workload generates many small random operations, the CPU becomes a submission bottleneck even though the data path avoids an extra copy.

Tutti's goal is stronger: it wants the GPU to issue massive parallel I/O requests directly and asynchronously, with the CPU preparing coarse metadata rather than controlling every operation on the critical path.

2. What this paper does

Tutti targets long-context LLM serving systems where prefix caching is valuable but KV cache capacity exceeds HBM and DRAM. The paper argues that SSD-backed KV cache is necessary for capacity, but existing approaches fail because their I/O path is not designed for paged KV cache.

The design can be summarized as a shift from this:

CPU-centric path

inference engine -> CPU prepares many I/Os -> GDS or ordinary I/O -> SSD
                         ^ repeated for many fragmented KV blocks

to this:

GPU-centric path

CPU prepares layer-level metadata once
GPU-side queues issue many KV-object I/Os directly
scheduler places I/O into profiled slack windows

Figure 1 in the paper captures this distinction. LMCache with or without GDS still depends on CPU control for the I/O path. Tutti moves the critical control path onto the GPU. In the authors' framing, the CPU should be responsible for coarse orchestration and metadata setup, while the GPU should drive high-parallelism I/O once inference begins.

The paper's core claim is that SSD capacity can be made practical if three conditions hold:

The abstraction matches KV cache objects. Storage should expose objects that align with layer-wise KV cache movement, not just raw blocks or generic files.
The control path is massively parallel. I/O submission/completion should not require CPU intervention for each small object.
I/O must be scheduled with compute awareness. Even GPU-issued I/O can hurt inference if storage kernels compete with model kernels for SMs or if reads and writes collapse SSD bandwidth.

The design is therefore a co-design across memory management, storage I/O, GPU scheduling, and the serving engine. This is why the paper is more than a storage optimization: it changes how the inference runtime thinks about cache retrieval and persistence.

3. Method details

3.1 Motivation from Figures 2 and 3: SSD bottlenecks are structural

Figure 2 compares vLLM with LMCache across HBM, DRAM, and SSD tiers on Llama3-8B, with 64K sequence length and 75% hit rate. The figure shows a pattern that is easy to miss if we only think about raw capacity:

DRAM-backed cache remains relatively close to HBM.
SSD-backed cache creates large GPU bubbles.
GDS helps with data movement but still leaves large stalls.
As vLLM's compute side improves from v0.12.0 to v0.17.0, the storage bottleneck becomes more visible.

The last point is important. A faster model runtime can make a storage path look worse, because the time hidden behind computation shrinks. If the storage path does not improve, its fraction of end-to-end latency grows.

Figure 3 studies CPU vs GPU hash performance for dynamic cache management. The paper reports that GPU hash table insert and lookup can be much slower than CPU-side hash tables for the tested sequence lengths: insert costs are 9.0× to 24.2× higher and lookup costs are 25.6× to 50.0× higher. This motivates a hybrid design: keep complex metadata management on the CPU, but avoid using the CPU for every I/O operation during the critical path.

That design stance is sensible. Moving all cache metadata logic onto the GPU would sound clean architecturally, but hash-table control flow is not naturally GPU-friendly. Tutti instead uses a "CPU-prepared, GPU-executed" model.

3.2 GPU-native KV cache object store

Tutti introduces a GPU-centric KV cache object store. The key idea is that the object abstraction should match how the inference engine uses KV cache.

The paper describes three main pieces:

a GPU file pool, visible to the inference engine;
an NVMe file pool, managed through GeminiFS-like physical storage extents;
a P2P memory mapping table, which maps GPU-visible objects to physical storage locations.

Figure 4 illustrates this layout. Tutti maps GPU files to NVMe files through a tensor-stripe layout that follows the original KV tensor granularity instead of arbitrary fine-grained storage striping. The GPU file shape aligns with KV cache objects, roughly following the tensor dimensions needed for layer-wise movement.

The practical effect is that the inference engine can issue layer-wise retrieve_layer and store_layer operations, rather than constructing a huge number of independent tiny requests. The CPU still manages engine-visible mappings, allocation, and indexing. But once a batch of lightweight GPU I/O contexts is created, the GPU can execute the I/O operations concurrently.

The paper claims this reduces CPU overhead from:

1	O(layer × blocks)

to:

O(layer)

for the critical I/O preparation pattern. That is exactly the kind of asymptotic improvement one wants when sequence length and block count become huge.

PRP vs SGL: why descriptor format matters

A very concrete detail in this section is the use of Scatter Gather Lists (SGL) instead of Physical Region Pages (PRP) for NVMe address description.

PRP uses fixed 4 KB pages. For large variable KV cache transfers, this can require many pointer pages and lots of address translation. The paper gives an example: describing a 60 GB KV cache on 80 GB HBM with PRP can require 15,728,640 pages, and if PRP list pages are allocated at 64 KB granularity, actual HBM usage can reach about 3.75 GB.

SGL is more compact for this use case. It can describe a large contiguous chunk with a 16-byte entry containing physical address, length, and identifier. In the same example, memory consumption drops to about 15 MB.

This is a nice systems detail because it shows that "GPU direct I/O" is not a single switch. The command descriptor format can be the difference between an elegant design and one that burns gigabytes of HBM for metadata.

3.3 GPU io_uring

The second component is GPU io_uring, written in the paper as gio_uring. It mirrors the CPU-side io_uring idea but moves asynchronous submission and completion into GPU-resident structures.

Figure 5 shows the architecture. The design has:

submission queues (SQ) and completion queues (CQ) in GPU HBM;
I/O control blocks (IOCBs), each containing many I/O contexts (IOCTXs);
zero-copy ring buffers mapped to the CPU through non-cached mmap;
CUDA events to preserve ordering under out-of-order stream execution;
GPU kernels that issue NVMe commands and write completions without CPU participation per I/O.

The runtime flow is roughly:

1. init_queue(depth)
   create GPU-resident SQ/CQ and IOCBs

2. get_iocb(nums, event)
   reserve IOCBs and fill CPU-prepared metadata

3. issue_io(IOCB_ids, SMs)
   launch GPU I/O kernel with assigned SM budget

4. wait_cqe(IOCB_ids)
   wait for GPU-side completion entries

The central point is not just that I/O becomes asynchronous. Existing CPU-side asynchronous I/O can also be asynchronous. The more important point is that the GPU can issue and complete many I/O operations without the CPU repeatedly stepping into the path.

SM partitioning

A subtle challenge is that GPU I/O kernels can interfere with inference kernels. GPUs are not preemptive in the way CPUs are. A long-running I/O kernel can monopolize resources and delay latency-sensitive compute kernels.

Tutti uses NVIDIA green contexts to partition GPU resources into a compute domain and an I/O control domain. The I/O kernel runs on dedicated SM resources, reducing long-tail interference. This is not free: dedicating SMs to I/O means those SMs are not always available to model compute. But the paper's claim is that the determinism and overlap benefits outweigh this cost for the target workload.

This is one of the places where Tutti feels like a true inference-runtime design rather than a generic storage library. It is not enough to issue I/O quickly; the I/O must coexist with model execution.

3.4 Slack-aware I/O scheduling

The third component is the slack-aware I/O scheduler. This is where the paper addresses the question: if I/O happens on the GPU, when should it happen?

Figure 6 shows that naive concurrent reads and writes can collapse PCIe bandwidth. The paper reports a 60.1% bandwidth drop under concurrent read/write execution in the reproduced FIO-style experiment. The cause is not merely that bandwidth is split between reads and writes. The authors point to contention in NVMe internal resources such as SSD cache behavior.

Figure 7 then shows the scheduler design. Tutti profiles slack windows offline. A slack window is a region in execution where there are spare SM resources and where issuing I/O will not create harmful read/write contention. The lookup table is indexed by input length and prefix length, because attention cost changes with context length.

The scheduler follows several rules:

Reads are prioritized during prefill because KV retrieval is on the critical path for prefix reuse.
Writes are deferred when they would interfere with critical reads or TTFT.
If a profiled slack window exists, the scheduler launches as many IOCBs as fit inside that window.
If no suitable slack window exists, retrieval has become the bottleneck, so the scheduler launches required reads immediately to avoid blocking later compute.
Remaining writes can be flushed during decode with a best-effort policy.

This scheduling policy is pragmatic. It does not try to derive a perfect online optimizer. Instead, it uses offline profiling to make runtime decisions cheap and predictable.

A useful way to view this design is:

bad pipeline:
  read and write whenever possible -> bandwidth collapse + SM interference

Tutti pipeline:
  profile compute slack -> issue critical reads first -> defer writes -> hide I/O when safe

3.5 Integration with vLLM and multi-GPU deployment

The paper reports an implementation of about 8,000 lines of C++ plus about 1,500 lines of Python integrated with vLLM's KVConnector. Tutti exposes retrieve_layer and store_layer operations, registers the pre-allocated KV memory block pool, identifies reusable prefixes, and maps logical KV blocks to GPU files.

For multi-GPU deployments, the paper follows vLLM's one-process-per-GPU model. Each GPU process has its own Tutti instance and manages the KV cache for its GPU-resident layers. A local daemon allocates GPU memory and initializes dedicated NVMe submission/completion queues for each GPU. Because each GPU has an independent queue pair, GPUs can access local NVMe in parallel without inter-GPU queue contention.

The authors also discuss scale beyond a single node. Tutti remains the local high-performance path, while Mooncake provides distributed coordination for space allocation, replica metadata, and location lookup. The current remote path is not fully optimized: if the desired KV cache is remote, the prototype can read the GPU file through a CPU-side interface and transfer it via RDMA, but the authors leave a more direct GPU-driven remote path to future work.

That limitation matters for large production clusters. The local path is the strongest part of the paper. The distributed path is plausible but less mature.

4. Experiment setup

4.1 Hardware and storage configuration

The evaluation uses a server with:

64-core Intel Xeon 6530 CPU;
512 GB system memory;
two NVIDIA H100 GPUs, each with 80 GB HBM;
four Solidigm D7-PS1010 7.68 TB enterprise NVMe SSDs.

For tiered-storage configurations, the authors allocate 256 GB host DRAM as pinned memory and provision 14 TB SSD volume per GPU. SSD experiments use two-disk RAID-0 for the relevant raw bandwidth measurements.

This is high-end hardware, and that matters when interpreting the results. Tutti is designed for serious serving infrastructure, not commodity single-GPU desktops.

4.2 Models

The main single-GPU experiments use Llama3-8B. The multi-GPU scalability test uses GLM-4-9B-Chat-1M, which supports a 1M-token context window and is deployed across two GPUs through tensor parallelism.

The model choices are sensible. Llama3-8B is small enough to isolate serving-system behavior on one H100, while GLM-4-9B-Chat-1M stresses the long-context multi-GPU case that motivates SSD-backed KV capacity.

4.3 Workloads

The evaluation uses two long-context benchmarks:

LEval, with 20 sub-tasks across law, finance, technology, academic papers, code, and other domains. Input lengths range from about 3K to 200K tokens.
LooGLE, with four ultra-long-context tasks, many samples above 100K tokens, focused on long-dependency QA and single-turn summarization.

Because these datasets do not provide native request timestamps, the authors simulate arrivals using a Poisson process and use round-robin extraction across sub-datasets. This is a standard experimental compromise for serving evaluations, though it is still a synthetic traffic model.

4.4 Baselines

The baselines include:

HBM: standard vLLM serving with HBM only.
LMCache-DRAM-LW: host-memory extension with layer-wise compute/I/O pipelining.
LMCache-SSD: NVMe SSD offload using memory copy and standard asynchronous I/O.
LMCache-GDS: SSD access with GPU Direct Storage to avoid CPU bounce buffers.
In ablations, LMCache-DRAM without layer-wise transfer is also shown.

The paper compares across two generations of vLLM: v0.12.0 and v0.17.0. This is useful because serving engines are improving quickly. A storage design that only works with a slower runtime may become obsolete as model execution becomes more efficient.

4.5 Metrics

The paper reports:

TTFT for first-token responsiveness;
ITL for decode smoothness;
cache hit rate by storage tier;
raw retrieve/store bandwidth;
GPU bubble time;
normalized serving cost per 1M tokens.

This metric set is well chosen. It connects the storage microbenchmarks to user-visible serving outcomes.

5. Results and analysis

5.1 Cache hit rates: Table 1 explains why SSD capacity matters

Table 1 reports cache hit rates across storage tiers:

Storage medium	LEval hit rate	LooGLE hit rate
HBM	8%	4%
DRAM	53%	24%
SSD	84%	86%

This table is one of the most important pieces of evidence in the paper. HBM alone has extremely low hit rates for long-context reuse. DRAM helps, but it is still insufficient for LooGLE. SSD capacity captures most reusable KV states.

The implication is that production systems face a real tradeoff:

1
2
3

HBM-only: fast hits, but very few hits
DRAM tier: more hits, still limited capacity
SSD tier: many hits, but only useful if retrieval is fast enough

Tutti tries to make the third option viable. Without a fast SSD path, the high hit rate in Table 1 would not translate into good latency.

5.2 End-to-end TTFT and ITL: Figure 8

Figure 8 compares end-to-end TTFT and ITL on Llama3-8B across LEval and LooGLE, under both vLLM v0.12.0 and v0.17.0.

The paper reports several key findings:

On LEval with the older vLLM version, Tutti improves TTFT over GDS by 71.8% at the highest load point.
With the newer vLLM version, Tutti reduces TTFT by 69.1% versus DRAM and 78.3% versus GDS at high load.
Under a 1-second TTFT SLO, Tutti increases effective request rate by 50% over DRAM and 100% over GDS.
On LooGLE at 0.6 RPS in the newer version, GDS's TTFT is still about 2.63× Tutti's TTFT.
At the same LooGLE load point, Tutti reduces TTFT by 93.2% versus DRAM and 62.0% versus GDS.

The ITL story is also positive:

On LEval with the older runtime at 1.5 RPS, Tutti reduces ITL by 60.4% versus DRAM and 24.9% versus GDS.
With the newer runtime at 1.5 RPS, Tutti still reduces ITL by 22.0% versus DRAM and 24.4% versus GDS.
On LooGLE, the ITL gain narrows because much longer inputs make each token more compute-heavy, but Tutti remains consistently better.

My interpretation: Tutti helps most when storage latency is visible enough to hurt the pipeline but there is still enough compute to overlap against. When the task becomes extremely compute-heavy, the relative decode gain narrows. When the task becomes extremely retrieval-heavy, DRAM can sometimes regain an advantage. The practical sweet spot is broad, but it is not universal.

5.3 Retrieve and store bandwidth: Figure 9

Figure 9 isolates raw retrieve/store bandwidth across context lengths from 1K to 128K tokens.

For retrieval:

Tutti scales smoothly and reaches up to 25.9 GB/s for longer contexts.
LMCache-GDS saturates around 11.9 GB/s even with two SSDs.
Tutti therefore achieves up to 2.08× higher retrieval bandwidth than LMCache-GDS.
LMCache-DRAM shows instability, including a reported drop to 8.5 GB/s at 16K tokens due to memory fragmentation overhead.

For store bandwidth:

LMCache-DRAM reaches up to 18.4 GB/s, but it is limited by DRAM capacity and lacks SSD persistence.
Tutti sustains roughly 10 GB/s persistent write bandwidth, including 9.8 GB/s at 128K tokens.
LMCache-GDS stays around 7 GB/s in the same dual-SSD configuration.

The key point is that retrieval bandwidth matters more for TTFT because cache hits are on the prefill critical path. Store bandwidth still matters, but writes can often be deferred and scheduled during slack windows.

5.4 PRP vs SGL: Figure 10

Figure 10 validates the descriptor choice. In a single-GPU-thread microbenchmark reading and writing 500 MB per operation:

Descriptor path	Read bandwidth	Write bandwidth
PRP	0.287 GB/s	0.032 GB/s
SGL	8.891 GB/s	2.922 GB/s

That is a 31.0× read improvement and a 91.3× write improvement from switching to SGL.

This is a striking result because it comes from a low-level design decision that many high-level serving discussions would skip. The lesson is that storage command structure can dominate performance when the workload is a flood of medium-sized GPU memory objects.

5.5 TTFT across prefix length: Figure 11

Figure 11 fixes total input length at 128K tokens and varies the cached prefix from 16K to 128K. This tests how the systems behave as prefix reuse increases.

The paper reports:

At a 112K cached prefix, LMCache-SSD reaches 7.84 s TTFT.
Tutti achieves 3.43 s at the same prefix, 2.28× faster than LMCache-SSD.
Compared with LMCache-GDS, Tutti improves TTFT across prefix lengths, from 5.8% at 32K up to 61.4% at 128K.
For moderate reuse from 16K to 96K, Tutti can match or exceed DRAM performance, with up to 13.4% improvement.
At very high reuse above 96K, the workload becomes almost purely retrieval-bound, and DRAM can regain a lead; Tutti trails by at most 20.6%.

This result is nuanced and useful. It shows that Tutti is not magically faster than DRAM in all cases. Rather, it can beat DRAM when overlap and capacity effects matter more than raw media latency. When almost all useful work is retrieval and there is little compute left to hide I/O, DRAM's latency advantage reappears.

5.6 Multi-GPU scalability: Figure 12

Figure 12 evaluates GLM-4-9B-Chat-1M across two GPUs and four disks. The paper reports that at a 128K prefix, Tutti achieves 155.743 s TTFT, about 25% lower than LMCache-GDS at 207.12 s.

The more important qualitative result is that LMCache-GDS fails with out-of-memory errors at 512K and 640K prefix lengths. The paper attributes this to GDS's need for GPU memory staging buffers through cufile. Tutti avoids this staging-buffer overhead by directly managing GPU memory through its registered interfaces, so it completes the longer-prefix tests.

I would read this result as architectural evidence rather than only a latency number. Long-context systems are often limited by memory-management edge cases, not just average bandwidth. A storage path that requires extra GPU staging memory can become fragile exactly when the context length is most demanding.

5.7 Bubble time and the crossover point: Figure 13

Figure 13 decomposes latency into computation time and bubble time while varying cache hit rate. The idea is straightforward:

if T_compute > T_transfer:
    I/O can be hidden behind computation
else:
    storage bubble becomes visible

For LMCache-SSD, bubble time is large and cannot be hidden effectively. For Tutti, the paper reports that bubble time is negligible across most of the tested range, averaging 25 ms and dropping to 6 ms at a 93.75% hit rate. Tutti pushes the crossover point to a 98.3% cache hit rate.

This is perhaps the cleanest evidence for the scheduler. The bandwidth numbers show that the I/O path is faster; Figure 13 shows that the scheduler makes the faster path useful by hiding it behind compute.

5.8 Cost: Figure 14

Figure 14 evaluates serving cost per 1M tokens. The cost model combines GPU cost, DRAM cost, SSD cost, and throughput:

1	cost per 1M tokens = (GPU cost + memory/storage cost) / throughput × 1,000,000

The paper uses typical cloud prices:

$5/hour per NVIDIA H100;
$0.0088/GB/hour for DRAM;
$0.000082/GB/hour for NVMe SSD.

The paper reports that on LooGLE at 0.5 QPS, Tutti reduces serving cost by 66.2% compared with LMCache-SSD and by about 27% compared with LMCache-GDS.

The cost result follows naturally from the latency result. SSD capacity is cheap, but cheap storage is not enough if it leaves the GPU underutilized. Tutti's cost advantage comes from using cheap SSD capacity while keeping the expensive GPU busy.

6. Limitations and boundary conditions

6.1 The design is hardware-specific

Tutti relies on high-end GPUs, NVMe SSDs, GPU direct access mechanisms, CUDA features, and careful SM partitioning. The evaluation hardware uses H100 GPUs and enterprise SSDs. The design may not translate directly to older GPUs, consumer SSDs, PCIe topologies with weaker peer-to-peer support, or cloud environments where low-level storage control is restricted.

This does not weaken the paper's contribution, but it narrows the deployment audience. Tutti is a data-center serving-system design.

6.2 Offline profiling is a practical dependency

The slack-aware scheduler depends on offline profiling of per-layer slack windows by input length and prefix length. This is reasonable for stable production deployments, but it adds operational complexity.

Profiles may need regeneration when any of the following changes:

model architecture;
tensor-parallel layout;
GPU type;
CUDA/runtime version;
vLLM version;
attention kernel implementation;
SSD configuration;
batch scheduling policy.

If the workload shifts far outside the profiled region, scheduler quality may degrade.

6.3 Distributed remote retrieval is not yet the main strength

The paper describes integration with Mooncake for distributed metadata and placement, but the current remote path still uses CPU-side reading into host memory followed by RDMA transfer. The authors explicitly leave a more direct GPU-driven remote path as future work.

That means the strongest evidence is for local SSD-backed KV reuse. A multi-node production service would still need careful engineering around cross-node placement, replica freshness, remote retrieval latency, and failure recovery.

6.4 The comparison depends on LMCache implementations

The paper compares against LMCache-SSD and LMCache-GDS as state-of-the-art baselines. These are fair baselines, but storage systems evolve quickly. A future LMCache version with deeper GPU-side submission or better scheduling could reduce the gap.

The most durable contribution is therefore not the exact percentage improvement over one baseline version. It is the architectural argument: CPU-centric I/O is a poor match for fragmented KV-cache retrieval at long context lengths.

6.5 Cache hit assumptions matter

Tutti is valuable when there is enough prefix reuse to justify persistent KV storage. If a workload has low reuse, high cache churn, or privacy rules that prevent sharing KV cache across sessions, SSD capacity may not translate into useful hits.

The system also has to decide what to keep, what to evict, and how to avoid retaining stale or low-value prefixes. The paper focuses on making SSD-backed reuse fast; it does not fully solve cache admission policy, tenant isolation, semantic reuse policy, or privacy boundaries.

6.6 The code availability story is not fully clear from the paper

The abstract calls Tutti an open-source SSD-backed KV caching solution, but the paper text available in the PDF does not provide a clear repository URL in the main body. For reproducibility, a reader would need the code, build instructions, compatible vLLM versions, kernel/driver requirements, SSD setup scripts, and profiling workflow.

Without those details, reproducing the full system from the paper alone would be difficult.

7. Reproducibility and practical notes

7.1 What I would need to reproduce the evaluation

To reproduce the paper convincingly, I would want:

the Tutti C++ and Python code;
exact vLLM branches for v0.12.0 and v0.17.0 integration;
LMCache versions matching the paper, including LMCache 0.3.9 and 0.4.1 where shown;
CUDA, driver, GDS, and NVMe driver versions;
GeminiFS or equivalent GPU file-system setup;
SSD RAID-0 setup and queue configuration;
profiling scripts for slack-window lookup tables;
workload preprocessing for LEval and LooGLE;
traffic generator code for Poisson arrivals;
exact SLO definitions and dropped-data-point rules.

The paper gives enough high-level information to understand the design and results, but not enough by itself to rebuild the system confidently.

7.2 Production rollout checklist

If I were evaluating Tutti-like ideas for production, I would check the following before deployment.

Workload fit

Is prefix reuse high enough to justify persistent KV cache?
Are long-context prompts common, or only rare tail events?
Are privacy and tenant-isolation requirements compatible with KV reuse?
Does the traffic distribution resemble LEval/LooGLE-style long-context workloads, or something very different?

Hardware fit

Do GPUs and SSDs sit under a favorable PCIe topology?
Is peer-to-peer DMA available and stable?
Are SSDs enterprise-grade enough to sustain mixed read/write load?
Can the runtime reserve SMs for I/O without hurting compute throughput too much?

Runtime fit

Which vLLM version is used?
Does the serving stack expose KVConnector-style integration points?
Can the team maintain CUDA/C++ storage kernels safely?
How often will offline slack profiles need to be regenerated?

Operational fit

What happens when an SSD fails?
How are cached KV objects evicted, replicated, and invalidated?
How does the system monitor GPU bubbles, SSD queue depth, and cache hit quality?
Can the system fall back to DRAM or recomputation if the SSD tier misbehaves?

7.3 The broader systems lesson

The paper's broader lesson is that LLM serving is becoming a full-stack systems problem. The boundary between model execution and storage is disappearing.

Earlier inference systems could treat storage as a supporting component. Load weights, store logs, maybe swap some memory. Long-context serving changes this. KV cache can be larger than model weights, and cache movement can dominate the latency path. At that point, the storage stack must understand the model's execution schedule.

Tutti is a good example of this new design style:

model structure -> layer-wise KV objects
GPU runtime     -> io_uring-like asynchronous queues
storage device  -> SGL-friendly high-bandwidth transfers
scheduler       -> profile compute slack and avoid interference
serving metric  -> TTFT / ITL / cost under load

The paper is strongest when it shows that no single trick is enough. GDS alone removes a copy but not CPU control overhead. SSD capacity alone increases hit rate but not latency. Asynchronous I/O alone can still interfere with compute. The final gain comes from aligning all pieces around the real dataflow.

8. My takeaways

SSD-backed KV cache is necessary but not automatically useful. Table 1 shows that SSD capacity can capture far more reusable KV state than HBM or DRAM. But Figure 2 shows that naive SSD retrieval can be slower than recomputation.
The CPU control path is the bottleneck for fragmented KV I/O. GDS removes a data copy, but if every fragmented I/O still depends on CPU submission, long-context retrieval remains bottlenecked.
The right abstraction is a KV object, not a generic block. Tutti's object store matters because it aligns storage operations with layer-wise KV movement.
Scheduling matters as much as bandwidth. Figure 13 is the key evidence: Tutti does not merely improve raw GB/s; it hides I/O behind compute and keeps visible bubble time small.
The design has real deployment complexity. Tutti requires kernel-level, CUDA-level, and serving-runtime integration. It is powerful, but it is not a drop-in Python cache.
The paper points toward future LLM serving stacks. I expect more systems to move in this direction: GPU-controlled data movement, model-aware storage layout, and scheduling policies that jointly optimize compute and I/O.

References and follow-up reading

Qiu et al., Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving, arXiv:2605.03375, 2026.
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP 2023.
Liu et al., CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving, SIGCOMM 2024.
Qin et al., Mooncake: Kimi's KVCache-centric Architecture for LLM Serving, arXiv:2407.00079, 2024.
Qiu et al., GeminiFS: A Companion File System for GPUs, FAST 2025.
Qureshi et al., GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture, ASPLOS 2023.
Chen et al., IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference, FAST 2025.
Gao et al., Fast State Restoration in LLM Serving with HCache, EuroSys 2025.

Review written on 2026-05-10.