Ring Attention: Blockwise Transformers for Near-Infinite Context — In-Depth Technical Review

1. What This Paper Does

Ring Attention solves one of the most stubborn problems in modern deep learning: the memory wall that prevents Transformers from processing long sequences. Even with memory-efficient attention (FlashAttention) and blockwise computation, the output activations of each Transformer layer must be stored and have size proportional to the sequence length. For 100 million tokens with hidden size 1024, that alone exceeds 1,000 GB — far beyond any single GPU or TPU.

The key insight is elegant: if you compute self-attention in a blockwise fashion (block-by-block), the order in which you process key-value blocks does not matter, as long as you combine the statistics correctly. This permutation invariance means you can place devices in a ring, have each device hold one query block, and rotate key-value blocks around the ring. While a device computes attention against the current key-value block, it simultaneously sends that block to the next device and receives a new block from the previous device. If the computation takes longer than the communication, the communication is completely hidden — zero overhead.

The result: context length scales linearly with the number of devices. On 1024 TPUv4 chips, Ring Attention trains with over 16 million tokens of context. On 512 TPUv3 chips, it achieves 2 million tokens. This is not approximate attention — it computes exact full self-attention. And it composes naturally with existing parallelism strategies (FSDP, tensor parallelism).

This paper is foundational because it removes the memory constraint from the device level to the cluster level, making context length an engineering problem of adding more devices, not a fundamental architectural limitation.

2. Prerequisites: Everything You Need to Understand This Paper

This section covers all the background knowledge needed to understand Ring Attention. If you are already familiar with these topics, skip ahead. If not, read carefully — each concept builds toward the paper's contribution.

2.1 Self-Attention: The Core Mechanism

The Transformer architecture (Vaswani et al., 2017) is built around self-attention, which allows every token in a sequence to attend to (look at) every other token. Given an input sequence of $s$ tokens, each with dimension $d$ , we form three matrices:

Queries $Q \in \mathbb{R}^{s \times d}$ : "What am I looking for?"
Keys $K \in \mathbb{R}^{s \times d}$ : "What do I contain?"
Values $V \in \mathbb{R}^{s \times d}$ : "What information do I carry?"

The attention computation is:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V$

The matrix $QK^T$ has shape $s \times s$ — every query attends to every key. This is why attention is $O(s^2)$ in both time and memory. For $s = 1{,}000$ , that is 1 million entries. For $s = 100{,}000$ , it is 10 billion entries. For $s = 1{,}000{,}000$ , it is 1 trillion entries. This quadratic scaling is the central bottleneck.

2.2 The Feedforward Network

After self-attention, each Transformer layer applies a position-wise feedforward network (FFN) to each token independently:

$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

With hidden dimension $4h$ (where $h$ is the model's hidden size), the FFN stores intermediate activations of shape $s \times 4h$ . For long sequences, this is also a significant memory consumer.

2.3 Why Long Context Matters

Many real-world tasks require processing long sequences:

Books and documents: A novel can be 100K+ tokens
Code repositories: Understanding a codebase requires seeing many files together
Video understanding: A 1-hour video at 30fps generates millions of tokens
Scientific data: Gene sequences, protein structures, climate simulations
Reinforcement learning: Long trajectories of experience (states, actions, rewards)

Current hardware (GPUs/TPUs with 16–80 GB of high-bandwidth memory) severely limits how long a sequence can be. Most practical training runs cap context at 2K–32K tokens, leaving enormous potential untapped.

2.4 Memory Analysis of Standard Transformers

Let us trace exactly where memory goes in a Transformer layer. Using bfloat16 (2 bytes per element):

Attention matrix: $QK^T$ has shape $b \times n \times s \times s$ (batch × heads × seq × seq). In bfloat16, this is $2bns^2$ bytes. For $b=1$ , $n=32$ heads, $s=100{,}000$ : this is 640 GB just for the attention matrix.
FFN intermediate: Shape $b \times s \times 4h$ . For $h=4096$ : $2 \times 1 \times 100{,}000 \times 16{,}384 = 3.3$ GB.
Layer output: Shape $b \times s \times h$ . For $h=4096$ : $2 \times 1 \times 100{,}000 \times 4{,}096 = 0.8$ GB. This output must be stored because the next layer needs to attend over all of it.

The attention matrix is the primary bottleneck. But even if we eliminate it (which memory-efficient attention does), the layer output still scales linearly with $s$ and must be stored. For $s = 100{,}000{,}000$ (100M tokens) with $h = 1024$ : the output alone is $2 \times 100{,}000{,}000 \times 1024 = 200$ GB per layer. This is the fundamental wall that Ring Attention addresses.

2.5 Memory-Efficient Attention (Rabe & Staats, 2021)

The key observation by Rabe and Staats is that you do not need to materialize the full $s \times s$ attention matrix. Instead, you can compute attention one row (or one block of rows) at a time. For each query position, compute its attention scores against all keys, apply softmax, and accumulate the weighted sum of values. Then move to the next query position.

This reduces the memory from $O(s^2)$ to $O(s)$ — the attention matrix is never fully materialized. However, all the key-value pairs must still be accessible (stored in memory) for each query block.

The peak activation memory per layer becomes $2bsh + 4bch$ bytes, where $c$ is the block size used for tiling the computation.

2.6 FlashAttention (Dao et al., 2022)

FlashAttention is the GPU-optimized implementation of memory-efficient attention. It uses tiling to break the computation into blocks that fit in SRAM (the fast on-chip memory of a GPU), avoiding slow reads/writes to HBM (high-bandwidth memory, the GPU's main memory). It achieves the same memory savings as memory-efficient attention but with much better hardware utilization.

Key ideas:

Online softmax: Compute softmax incrementally across blocks using running max and sum statistics
Tiled computation: Process blocks of queries and key-value pairs at a time
No materialization: Never store the full attention matrix in HBM

FlashAttention is a single-device optimization. It makes attention memory-efficient on one GPU but does not solve the cross-device distribution problem. Ring Attention builds on top of it.

2.7 Blockwise Parallel Transformers (BPT) — The Critical Foundation

Blockwise Parallel Transformers (Liu & Abbeel, 2023) go one step further than memory-efficient attention. The insight: not only can attention be computed block-by-block, but the feedforward network can also be fused into the same blockwise loop. Instead of:

Compute all attention outputs (requires storing entire output: $bsh$ elements)
Then compute all FFN outputs

BPT does:

For each query block: compute attention output for that block, immediately apply FFN to that block, then discard

This reduces the maximum activation from $8bsh$ bytes (vanilla) or $2bsh + 8bsh$ (memory-efficient attention only) to just $2bsh$ bytes per layer. The block size $c$ is independent of the sequence length $s$ .

This is the foundation Ring Attention builds upon. BPT reduces per-layer memory to $2bsh$ , but this still grows linearly with $s$ . For 100M tokens, it is still over 1,000 GB.

2.8 Gradient Checkpointing (Activation Checkpointing)

During training, we need to store intermediate activations for the backward pass. Gradient checkpointing (Chen et al., 2016) trades memory for compute: instead of storing all activations, store only selected checkpoints and recompute the rest during the backward pass.

In the context of Ring Attention, full gradient checkpointing is applied to both attention and feedforward. This means each layer only stores its output for the backward pass, and all internal computations are recomputed as needed. The layer output (size $bsh$ ) is the irreducible memory cost.

2.9 Ring Topology in Distributed Computing

A ring topology is a communication pattern where $N$ devices are arranged in a circle: device 1 → device 2 → ... → device $N$ → device 1. Each device communicates only with its immediate neighbors (one send, one receive).

This topology is widely used in high-performance computing:

Ring AllReduce (Baidu, 2017; Horovod): Aggregate gradients across devices by passing partial sums around the ring
Point-to-point communication: Low-bandwidth overhead since each device only talks to 2 neighbors
Natural pipeline: Data flows around the ring, visiting each device exactly once

The bandwidth requirement is modest: only the link between adjacent devices matters, not the total bisection bandwidth of the network.

2.10 Communication-Computation Overlap

The holy grail of distributed computing is to hide communication latency behind useful computation. If a device can compute on data it already has while simultaneously sending/receiving data it will need next, the communication is "free."

The condition for perfect overlap is:

$T_\text{compute} \geq T_\text{communicate}$

Or equivalently, the arithmetic intensity (FLOPs per byte transferred) must be high enough. This is exactly what Ring Attention exploits.

2.11 FSDP (Fully Sharded Data Parallel)

FSDP shards model parameters across devices, so each device holds only a fraction of the parameters. During the forward pass, parameters are gathered from other devices as needed; during backward, they are released. This reduces parameter memory from $O(\text{model size})$ per device to $O(\text{model size} / N)$ .

FSDP is orthogonal to Ring Attention. Ring Attention distributes the sequence dimension, while FSDP distributes the model parameters. They compose naturally: use FSDP to fit the model, then use Ring Attention to extend the context.

3. The Ring Attention Method

Now we have all the pieces. Let us build Ring Attention step by step.

3.1 Problem Setup

We have:

$N_h$ devices (hosts) arranged in a ring
An input sequence of length $s$ , which is split into $N_h$ blocks of size $s / N_h$
Each device holds one block of queries $Q_i$ , keys $K_i$ , and values $V_i$

The goal: compute exact self-attention as if all $s$ tokens were on a single device, but without any device ever holding more than $O(s/N_h)$ tokens of data.

3.2 Why Blockwise Attention Enables This

Recall that in blockwise attention, we compute:

$\text{output}_i = \text{softmax}\left(\frac{Q_i [K_1; K_2; ...; K_{N_h}]^T}{\sqrt{d}}\right) [V_1; V_2; ...; V_{N_h}]$

The critical property: this computation can be done incrementally. We process one key-value block at a time, maintaining running statistics (max scores and sum of exponentials) to correctly compute the softmax across all blocks. The final result is identical regardless of the order in which key-value blocks are processed.

Formally, for a query block $Q_i$ and key-value block pair $(K_j, V_j)$ :

Compute local attention scores: $S_{ij} = Q_i K_j^T / \sqrt{d}$
Compute local max: $m_{ij} = \max(S_{ij})$
Compute local exponential sum: $l_{ij} = \sum \exp(S_{ij} - m_{ij})$
Compute local weighted values: $O_{ij} = \exp(S_{ij} - m_{ij}) \cdot V_j$

These local statistics can be combined across blocks using the online softmax trick:

$m_\text{new} = \max(m_\text{old}, m_{ij})$

$l_\text{new} = l_\text{old} \cdot \exp(m_\text{old} - m_\text{new}) + l_{ij} \cdot \exp(m_{ij} - m_\text{new})$

$O_\text{new} = O_\text{old} \cdot \frac{l_\text{old} \cdot \exp(m_\text{old} - m_\text{new})}{l_\text{new}} + O_{ij} \cdot \frac{l_{ij} \cdot \exp(m_{ij} - m_\text{new})}{l_\text{new}}$

Permutation invariance: Because we rescale all intermediate results relative to the global maximum, the order of processing key-value blocks does not matter. This is the mathematical property that makes Ring Attention possible.

3.3 The Ring Communication Protocol

Here is how Ring Attention works, step by step:

Initialization:

Split the input sequence into $N_h$ blocks
Device $i$ computes $Q_i$ , $K_i$ , $V_i$ from its local input block (no communication needed)

For each Transformer layer:

Round $t = 0$ : Each device $i$ $i$ computes blockwise attention between $Q_i$ $Q_{i}$ and its local $(K_i, V_i)$ $(K_{i}, V_{i})$
- Simultaneously, device $i$ sends $(K_i, V_i)$ to device $i+1$ and receives $(K_{i-1}, V_{i-1})$ from device $i-1$
Round $t = 1$ : Each device $i$ $i$ computes attention between $Q_i$ $Q_{i}$ and the newly received $(K_{i-1}, V_{i-1})$ $(K_{i - 1}, V_{i - 1})$ , updating running statistics
- Simultaneously, sends the received block onward and receives the next block
...
Round $t = N_h - 1$ : Final key-value block arrives. Computation completes.
Each device $i$ applies the feedforward network to its local attention output (no communication needed)

After $N_h$ rounds, every device has seen all $N_h$ key-value blocks. The attention output on each device is identical to what single-device full attention would produce.

3.4 Algorithm (Pseudocode)

Input: Sequence x, split across N_h hosts
Initialize: Each host has one input block, computes Q_i, K_i, V_i

For each transformer layer:
    For count = 1 to N_h - 1:
        For each host concurrently:
            1. Compute memory-efficient attention incrementally 
               using local Q and current K, V blocks
            2. Send K, V to next host in ring
            3. Receive K, V from previous host in ring
    For each host concurrently:
        Compute feedforward using local attention output

3.5 Arithmetic Intensity Analysis — When Is Communication Free?

This is the critical engineering question: under what conditions does computation perfectly overlap communication?

Computation cost per block: Computing attention between a query block of size $c$ and a key-value block of size $c$ requires:

$2dc^2$ FLOPs for $Q_i K_j^T$ (matrix multiply: $c \times d$ times $d \times c$ )
$2dc^2$ FLOPs for multiplying attention scores by $V_j$ ( $c \times c$ times $c \times d$ )
Total: $4dc^2$ FLOPs

Communication cost per block: Sending one key block and one value block requires:

$2cd$ bytes for $K_j$ (shape $c \times d$ , in bfloat16: $2cd$ bytes)
$2cd$ bytes for $V_j$
Total: $4cd$ bytes

Overlap condition: Computation time $\geq$ Communication time:

$\frac{4dc^2}{F} \geq \frac{4cd}{B}$

where $F$ is device FLOPS and $B$ is interconnect bandwidth. Simplifying:

$c \geq \frac{F}{B}$

This means: the block size must be at least the ratio of compute FLOPS to communication bandwidth.

Practical numbers (from the paper):

Device	FLOPS (TF)	Bandwidth (GB/s)	Min Block Size	Min Seq Length ( $6c$ )
A100 NVLink	312	300	1,000	6,200
A100 InfiniBand	312	12.5	24,500	149,500
TPUv3	123	112	1,100	6,600
TPUv4	275	268	1,000	6,200
TPUv5e	196	186	1,100	6,300

Key insight: For high-bandwidth interconnects (NVLink, TPU ICI), the minimum block size is only ~1K tokens — very easy to satisfy. For lower-bandwidth InfiniBand, it is ~25K, which is still practical for long-context training.

3.6 Memory Analysis

Each device needs to store:

1 query block: $bch$ bytes
2 key-value blocks (current): $2bch$ bytes
2 key-value blocks (receiving buffer): $2bch$ bytes
1 output block: $bch$ bytes

Total: $6bch$ bytes per layer.

Including the blockwise feedforward (max $2bch$ ), the total maximum activation size is $6bch$ bytes.

Comparison with prior work (per layer, bfloat16):

Method	Self-Attention	FFN	Total
Vanilla Transformer	$2bns^2$	$8bsh$	$2bhs^2$
Memory-efficient attention	$2bsh + 4bch$	$8bsh$	$8bsh$
BPT (mem-eff attn + FFN)	$2bsh$	$2bsh$	$2bsh$
Ring Attention	$6bch$	$2bch$	$6bch$

The crucial difference: Ring Attention's memory is proportional to $c$ (block size, which is $s/N_h$ ), not $s$ (full sequence length). Memory per device is independent of total sequence length — it only depends on the block size, which is the sequence length divided by the number of devices.

3.7 Causal Masking in the Ring

For autoregressive (causal) language models, tokens can only attend to earlier tokens. In Ring Attention, this means that when device $i$ processes key-value block $j$ , some or all entries may be masked:

If $j > i$ (key-value block is from a later position): the entire block is masked — skip computation entirely
If $j = i$ : apply causal mask within the block
If $j < i$ : no masking needed — all keys are from earlier positions

This optimization roughly halves the computation for causal models, since about half the key-value blocks are fully masked and can be skipped.

3.8 Backward Pass

The backward pass uses the same ring communication pattern. During backpropagation:

Gradients $dQ$ , $dK$ , $dV$ are computed incrementally
Key-value blocks (and their gradient buffers) rotate around the ring
Communication-computation overlap works identically to the forward pass

The Jax implementation uses custom_vjp to define both forward and backward passes, with lax.ppermute for ring communication and lax.scan for the iterative loop over key-value blocks.

4. Experimental Setup

4.1 Model Configuration

All experiments use the LLaMA architecture (Touvron et al., 2023):

Model sizes: 3B, 7B, 13B, 30B, 65B parameters
Standard Transformer with rotary positional encoding, SwiGLU FFN, RMSNorm

4.2 Baselines

Three baselines, representing the state-of-the-art progression:

Vanilla Transformer: Materializes the full $s \times s$ attention matrix
Memory-efficient attention: Blockwise attention computation (Rabe & Staats, 2021), plus FlashAttention (Dao et al., 2022) on GPU
BPT (Memory-efficient attention + FFN): Blockwise Parallel Transformer (Liu & Abbeel, 2023) — both attention and feedforward computed blockwise

4.3 Hardware

GPU: Single DGX A100 (8 GPUs with NVLink); 32 A100 GPUs with InfiniBand
TPU: TPUv3-512, TPUv4-1024, TPUv5e-256

4.4 Training Configuration

Full gradient checkpointing on both attention and FFN
FSDP (Fully Sharded Data Parallel) used for all methods
Full precision (float32) on GPU; bfloat16 matmul with float32 accumulation on TPU
No tensor parallelism in context length experiments (to isolate Ring Attention's contribution)

5. Experimental Results

5.1 Maximum Context Length

This is the headline result. Ring Attention extends context length by exactly the device count factor:

8× A100 (NVLink):

Model	Vanilla	Mem-Eff Attn	BPT	Ring Attention	Improvement
3B	4K	32K	64K	512K	8×
7B	2K	16K	32K	256K	8×
13B	2K	4K	16K	128K	8×

32× A100 (InfiniBand):

Model	Vanilla	Mem-Eff Attn	BPT	Ring Attention	Improvement
7B	4K	64K	128K	4,096K (4M)	32×
13B	4K	32K	64K	2,048K (2M)	32×

TPUv3-512:

Model	Vanilla	Mem-Eff Attn	BPT	Ring Attention	Improvement
7B	1K	4K	8K	2,048K (2M)	256×
13B	1K	2K	8K	1,024K (1M)	128×

TPUv4-1024:

Model	Vanilla	Mem-Eff Attn	BPT	Ring Attention	Improvement
3B	8K	16K	32K	16,384K (16M)	512×
7B	4K	8K	16K	8,192K (8M)	512×
13B	4K	8K	16K	4,096K (4M)	256×
30B	2K	4K	8K	2,048K (2M)	256×

TPUv5e-256:

Model	Vanilla	Mem-Eff Attn	BPT	Ring Attention	Improvement
3B	4K	8K	32K	4,096K (4M)	128×
7B	2K	8K	16K	2,048K (2M)	128×

Key observations:

Ring Attention consistently achieves exactly $N_h \times$ improvement over BPT, confirming the theoretical linear scaling
On TPUv4-1024, a 3B model achieves 16 million tokens of context — this was unprecedented at publication time
The improvement is uniform across model sizes, confirming that Ring Attention is model-size agnostic
Even on lower-bandwidth InfiniBand (32× A100), the approach works — the minimum block size is larger but still practical

5.2 Model FLOPS Utilization (MFU)

MFU measures what fraction of the hardware's theoretical peak FLOPS is actually used for useful computation. This is the critical efficiency metric.

Config	Model	Compute	BPT Context	Ring Attention Context	Expected MFU	Actual MFU
1	7B	8× A100	32K	256K	—	Comparable to BPT
2	13B	8× A100	16K	128K	—	Comparable to BPT
3	13B	32× A100	64K	2,048K	—	Comparable to BPT
4	30B	TPUv4-1024	16K	2,048K	—	Comparable to BPT
5	65B	TPUv4-1024	8K	1,024K	—	Comparable to BPT

The paper notes that Ring Attention's MFU is slightly lower than BPT's because the longer context means a higher proportion of FLOPs go to self-attention (which has lower arithmetic intensity than FFN). However, this is inherent to the problem — you are doing more attention because you have more context — not an overhead of Ring Attention itself.

Key finding: Ring Attention adds negligible overhead to MFU. The communication is fully overlapped with computation.

5.3 Reinforcement Learning: ExoRL Benchmark

Ring Attention enables conditioning on longer experience trajectories in offline RL. The experiment uses the ExoRL benchmark with the Agentic Transformer (AT) architecture.

Each trajectory has $1000 \times 4 = 4{,}000$ tokens (1000 timesteps × return-state-action-reward tuple). Prior methods could handle 32 trajectories (128K tokens) on a 350M model; Ring Attention enables 128 trajectories (512K tokens).

Task	BC-10%	DT	AT+ME (32 traj)	AT+BPT (32 traj)	AT+BPT (128 traj)	AT+Ring (128 traj)
Walker Stand	52.91	34.54	—	95.45	OOM	98.23
Walker Run	34.81	49.82	—	105.88	OOM	110.45
Walker Walk	13.53	34.94	—	78.56	OOM	78.95
Cheetah Run	34.66	67.53	—	178.75	OOM	181.34
Jaco Reach	23.95	18.64	—	87.56	OOM	89.51
Cartpole Swingup	56.82	67.56	—	120.56	OOM	123.45
Average	36.11	45.51	—	111.13	OOM	113.66

Key observations:

BPT with 128 trajectories runs out of memory (OOM) — the sequence is too long for a single device
Ring Attention enables 128 trajectories, improving average return from 111.13 to 113.66
The improvement is consistent across all 6 tasks, confirming that more context helps RL
The gains are modest (2.3%) because 32 trajectories already capture much of the useful experience; the point is that Ring Attention makes this possible where baselines cannot even run

5.4 Long-Context Language Model Evaluation

The authors fine-tuned LLaMA-13B to 512K context length using Ring Attention on 32 A100 GPUs, using the ShareGPT dataset (125K cleaned conversations).

Line Retrieval Test (from LongChat): The model must retrieve a specific number from a long document. This tests precise information retrieval across long contexts.

Results (from Figure 3):

Ring Attention-13B-512K: Maintains high accuracy (>90%) even at the maximum tested context lengths
GPT-3.5-turbo-16K: High accuracy up to 16K, then unavailable for longer contexts
Vicuna-16B-16K: Similar to GPT-3.5, limited to 16K
Claude-2-100K: Competitive up to 100K but accuracy degrades for longer contexts

Ring Attention-13B-512K is the only model that maintains strong performance across all tested context lengths, demonstrating that the long-context training enabled by Ring Attention translates to genuine long-range retrieval capability.

5.5 Training FLOPs Scaling Analysis

How much extra compute does longer context require? The per-sequence FLOPs formula is:

$(24bsh^2 + 4bs^2h) \times n$

where $h$ is hidden dimension, $s$ is sequence length, $n$ is number of layers. The per-dataset FLOPs ratio (relative to 4K context) is:

$\frac{6h + s_2}{6h + s_1}$

Key insight: For large models, the FFN dominates (the $24bsh^2$ term), so increasing context length adds relatively less overhead. For example:

LLaMA-7B ( $h=4096$ ): 4K → 1M context = ~40× FLOPs increase
GPT-3 175B ( $h=12288$ ): 4K → 10M context = ~162× FLOPs increase despite 2,500× longer context
A hypothetical 1TB model ( $h=36864$ ): 4K → 10M context = only ~46× FLOPs increase

This sub-linear relationship between context length and training cost makes long-context training increasingly practical for larger models.

6.1 vs. Sequence Parallelism (Li et al., 2023)

Standard sequence parallelism distributes the sequence across devices but requires all-to-all communication for the attention step — every device needs to see all keys/values. This communication is not easily overlapped with computation, introducing significant overhead.

Ring Attention uses the ring topology to overlap communication with blockwise computation, achieving zero overhead when the arithmetic intensity condition is met. It also requires communication only with immediate neighbors, not all-to-all.

6.2 vs. DeepSpeed Ulysses (Jacobs et al., 2023)

DeepSpeed Ulysses shards along the sequence and attention heads, using optimized all-to-all collectives. However, it is limited by the number of attention heads (cannot scale beyond $N_\text{heads}$ devices for sequence parallelism) and requires gathering the full sequence on each device during attention.

Ring Attention has no such limitation — it can scale to an arbitrary number of devices, limited only by the minimum block size constraint.

6.3 vs. Ring Topology for Self-Attention (Li et al., 2023)

Prior work (Li et al., 2023) also proposed a ring topology for self-attention, but without blockwise computation. The result: communication cannot be fully overlapped because the arithmetic intensity is too low (each device operates on the full local sequence, not blocks).

Ring Attention's innovation is combining the ring topology with blockwise computation, which dramatically increases arithmetic intensity per communication round and enables perfect overlap.

6.4 vs. Approximate Attention Methods

Many methods (Longformer, BigBird, Linformer, etc.) approximate attention to reduce the quadratic cost. These introduce approximation errors and often degrade model quality.

Ring Attention computes exact full self-attention. There is no approximation — the output is bit-identical to what you would get on a single device with unlimited memory.

7. Limitations and Boundary Conditions

7.1 Minimum Sequence Length

Ring Attention requires a minimum sequence length per device of $6c$ where $c \geq F/B$ . For A100 with InfiniBand, this is ~150K tokens per device. With 32 devices, the total sequence must be at least ~4.8M tokens. This means Ring Attention is most beneficial for genuinely long sequences, not for short sequences distributed across many devices.

7.2 Interconnect Bandwidth Dependency

The method's efficiency depends critically on interconnect bandwidth. On high-bandwidth interconnects (NVLink, TPU ICI), the minimum block size is ~1K tokens — trivially satisfied. On lower-bandwidth InfiniBand, it is ~25K tokens, which constrains the setup.

7.3 Compute Cost Still Grows Quadratically

Ring Attention eliminates the memory bottleneck but not the compute bottleneck. Self-attention is still $O(s^2)$ in FLOPs. For very long sequences, the total compute grows quadratically. The paper shows this is manageable for large models (where FFN dominates), but for small models with very long contexts, compute can become the bottleneck.

7.4 No Attention Pattern Optimization

Because Ring Attention computes full attention, it does not benefit from sparsity patterns (local attention, sliding window, etc.). Methods that exploit attention sparsity can be more efficient for tasks where sparse attention suffices.

7.5 JAX/XLA-Centric Implementation

The provided implementation is in JAX, using lax.ppermute and lax.scan. Porting to PyTorch/CUDA requires re-implementing the ring communication primitives, which is non-trivial (though has since been done by the community).

7.6 Training Data for Long Context

Ring Attention enables training on long sequences, but obtaining training data with genuinely long-range dependencies remains a challenge. The ShareGPT dataset used for fine-tuning has conversations that are much shorter than 512K tokens; the model is padded/concatenated to fill the context window. Whether this produces the same quality as training on naturally long documents is an open question.

8. Reproducibility

8.1 Code

Open-source JAX implementation available at: https://github.com/lhao499/llm_large_context

The core Ring Attention logic is ~60 lines of JAX code (shown in Appendix A of the paper), using:

jax.custom_vjp for forward/backward pass definition
lax.ppermute for ring send/receive
lax.scan for iterating over key-value blocks

8.2 Hardware Requirements

To reproduce the headline results:

TPUv4-1024: requires access to Google Cloud TPU pods (not widely available)
32× A100: more accessible but still requires a significant GPU cluster
8× A100 (single DGX): most reproducible; demonstrates 8× context extension

8.3 Key Configuration Notes

FSDP is used for all experiments to shard model parameters
Full gradient checkpointing with nothing_saveable policy
No tensor parallelism in context length evaluation (added only for MFU evaluation)
Batch size: 2M tokens on GPU, 4M tokens on TPU
Fine-tuning for line retrieval: LLaMA-13B on 32× A100, ShareGPT data (125K conversations after cleaning)

9. Significance and Impact

Ring Attention is a systems-level breakthrough that changes how we think about context length. Before Ring Attention, context length was fundamentally limited by single-device memory. After Ring Attention, context length is limited only by the number of devices and the compute budget.

This has profound implications:

Video understanding: Video at 30fps generates thousands of tokens per second; million-token context makes direct video processing feasible
Repository-level code understanding: Large codebases can be processed as a single sequence
Scientific applications: Gene sequences, protein structures, and climate data can be modeled at full length
Reinforcement learning: Agents can condition on much longer histories of experience
Multi-modal models: Combining text, image, audio, and video in a single context window

The paper has influenced subsequent work including LongRoPE, YaRN, and various long-context model training efforts. The core idea — overlapping ring communication with blockwise computation — has become a standard technique in large-scale distributed training.

10. Final Assessment

Ring Attention is one of those rare papers where the core idea is simple, the execution is clean, and the impact is transformative. The permutation invariance of blockwise attention → ring communication → zero-overhead overlap chain is elegant and fundamental. The paper does not introduce approximations, does not require architectural changes, and composes naturally with existing parallelism strategies.

Strengths:

Achieves theoretical optimal scaling (device count × context length)
Zero communication overhead when arithmetic intensity condition is met
Exact attention — no approximation
Clean implementation (~60 lines of core logic)
Comprehensive evaluation across GPUs, TPUs, model sizes, and applications

Weaknesses:

Compute cost still quadratic in sequence length (memory-only solution)
InfiniBand bandwidth creates a practical floor on minimum sequence length
Line retrieval evaluation is relatively simple; more sophisticated long-range reasoning benchmarks would strengthen the claims
JAX-centric; PyTorch support requires community effort

Overall: Ring Attention is essential reading for anyone working on long-context models, distributed training systems, or efficient ML. It removes the single-device memory wall and replaces it with a scaling law: more devices = proportionally more context. This is a fundamental contribution to the field.

Reviewed by Zhongzhu Zhou, March 29, 2026.