ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — In-Depth Technical Review (English)

Author: Zhongzhu Zhou Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 2020 / arXiv 2019) ArXiv: https://arxiv.org/abs/1910.02054 Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft)

Abstract

ZeRO (Zero Redundancy Optimizer) is one of the most important systems papers in the large-model training era. It fundamentally rethinks how memory is consumed during distributed deep learning training and proposes a family of optimizations that eliminate redundant memory storage across data-parallel processes — without sacrificing computational efficiency. The result is staggering: ZeRO enables training of models with over 100 billion parameters on 400 GPUs with super-linear speedup, achieving 15 PetaFlops of throughput. This represents an 8× increase in trainable model size and 10× improvement in throughput over the state-of-the-art at the time. Perhaps most importantly, ZeRO democratizes large model training: it allows data scientists to train models with up to 13 billion parameters using nothing more than standard data parallelism — no model parallelism, no pipeline parallelism, no model refactoring required. ZeRO is the backbone of Microsoft's DeepSpeed library and powered Turing-NLG, which at the time was the world's largest language model (17B parameters).

1. Prerequisites: What You Need to Know Before Reading This Paper

1.1 Data Parallelism (DP) — The Default Distributed Training Strategy

If you have ever trained a deep learning model on more than one GPU, you have almost certainly used data parallelism. The idea is straightforward: every GPU holds a complete copy of the model (parameters, optimizer states, gradients — everything). Each GPU processes a different mini-batch of data. After the backward pass, the gradients are averaged across all GPUs using an all-reduce collective communication operation, and then every GPU updates its local copy of the model identically.

Why it works well: Each GPU does the same amount of computation on its own data slice, so compute utilization is high. The only communication is the gradient all-reduce, and modern implementations (like NCCL's ring all-reduce) are highly optimized.

Why it breaks down for large models: The critical weakness is that every single GPU must hold the entire model state — parameters, gradients, and optimizer states. For a 1.5B parameter model using mixed-precision Adam, this is already 24 GB. For a 10B model it is 160 GB — far beyond what fits on a single GPU. Standard data parallelism simply cannot train large models because it replicates everything.

1.2 Model Parallelism (MP) — Splitting the Model Across Devices

When a model is too large for one GPU, model parallelism splits it vertically — different parts of each layer's computation are placed on different GPUs. For example, Megatron-LM splits the attention heads and MLP weight matrices across GPUs within a transformer layer.

The upside: Memory is shared; each GPU only holds a fraction of the parameters, gradients, and optimizer states for each layer.

The downside: Communication is required within every layer (all-reduce after each GEMM operation), which means MP only works efficiently when GPUs are connected by very fast interconnects (like NVLink within a single DGX-2 node, with 300 GB/s bandwidth). Cross-node MP over InfiniBand (12.5 GB/s) is disastrously slow. Megatron-LM's throughput drops to under 5 TFlops/GPU (below 5% of peak) when using MP across two DGX-2 nodes with a 40B parameter model.

Additionally, MP requires model developers to rewrite their model code for distributed execution, limiting its accessibility.

1.3 Pipeline Parallelism (PP) — Splitting by Layers

Pipeline parallelism splits the model horizontally — different layers run on different GPUs. GPipe and PipeDream are notable examples. The model is divided into stages, and micro-batches flow through the pipeline.

Problems: PP introduces "pipeline bubbles" (idle time), requires batch sizes proportional to the number of pipeline stages, and complicates features like tied weights and batch normalization. PipeDream keeps stale parameter copies to reduce bubble time, at the cost of convergence guarantees.

1.4 Mixed-Precision Training and Adam Optimizer Memory

Modern large model training uses mixed-precision training (fp16/fp32): forward and backward passes use fp16 for speed (leveraging GPU tensor cores), but the optimizer maintains fp32 copies for numerical stability.

For a model with Ψ parameters trained with Adam:

fp16 parameters: 2Ψ bytes
fp16 gradients: 2Ψ bytes
fp32 parameter copy (optimizer): 4Ψ bytes
fp32 momentum (Adam): 4Ψ bytes
fp32 variance (Adam): 4Ψ bytes
Total: 2Ψ + 2Ψ + 12Ψ = 16Ψ bytes

The memory multiplier K for Adam optimizer states is 12 (the fp32 copy + momentum + variance). This means a 1.5B parameter model needs 24 GB just for model states — and a 1.5B model's fp16 parameters alone are only 3 GB! The optimizer states dominate memory consumption.

1.5 Activation Memory and Checkpointing

Activations are the intermediate tensors saved during the forward pass for use in the backward pass. For transformer models, activation memory scales as: batch_size × seq_length × hidden_dim × num_layers. A 1.5B GPT-2 model with batch size 32 and sequence length 1024 requires ~60 GB of activation memory.

Activation checkpointing (also called gradient checkpointing or rematerialization) saves only a subset of activations (typically one per transformer layer) and recomputes the rest during the backward pass. This reduces memory by roughly √(total activations) at the cost of ~33% extra computation. For the 1.5B model, this brings activation memory from 60 GB down to ~8 GB. But for 100B+ models, even with checkpointing, activation memory can still reach 60 GB.

1.6 Communication Collectives: All-Reduce, Reduce-Scatter, All-Gather

Understanding ZeRO requires knowing these collective operations:

All-reduce: Every process ends up with the sum (or average) of all processes' data. Implemented as reduce-scatter + all-gather. Communication volume: 2Ψ (Ψ elements sent + Ψ elements received per process).
Reduce-scatter: Different parts of the data are reduced to different processes. Communication volume: Ψ per process.
All-gather: Each process sends its partition to all others so everyone has the full data. Communication volume: Ψ per process.

These volumes matter because ZeRO's key claim is that it achieves massive memory reduction with minimal (or zero) communication overhead.

2. What This Paper Does (The Core Idea)

The Problem: Memory Redundancy in Data Parallelism

The fundamental observation is breathtakingly simple: in standard data parallelism, every GPU stores an identical copy of the entire model state. With 64 GPUs, you have 64 identical copies of the optimizer states, 64 copies of the gradients, and 64 copies of the parameters. This is pure waste — the aggregate memory across all GPUs is enormous, but 63/64ths of it stores redundant information.

The Key Insight: Partition Instead of Replicate

ZeRO's core insight: partition the model states across data-parallel processes instead of replicating them. Each GPU stores only 1/N_d of the optimizer states, gradients, and/or parameters (where N_d is the data-parallel degree). When a process needs data it does not own, it requests it from the responsible process through communication.

The genius is that this can be done with zero or minimal increase in communication volume compared to standard DP, because the existing all-reduce in DP can be decomposed into exactly the communication primitives ZeRO needs.

Two Complementary Optimization Families

ZeRO-DP (Data Parallelism optimizations): Three progressive stages that partition optimizer states, gradients, and parameters respectively, reducing model state memory from 16Ψ bytes down to 16Ψ/N_d bytes.
ZeRO-R (Residual memory optimizations): Addresses the "leftover" memory consumers — activation memory, temporary buffers, and memory fragmentation — that become the bottleneck once model state memory is optimized.

Why This is Different from Model Parallelism

MP also partitions model states, but it does so by splitting the computation within each layer, which fragments the work into small pieces and requires expensive intra-layer communication. ZeRO-DP partitions the storage while keeping the computation unchanged — each GPU still processes a full forward/backward pass on its data (just like standard DP), and communication happens at well-defined, coarse-grained boundaries. This is why ZeRO retains DP's superior scaling efficiency.

3. Method Details

3.1 ZeRO-DP Stage 1: Optimizer State Partitioning (P_os)

Mechanism: Divide the optimizer states (momentum, variance, and fp32 parameter copy for Adam) into N_d equal partitions. GPU i only stores and updates partition i.

Training flow:

Forward pass: identical on all GPUs (all hold full fp16 parameters).
Backward pass: compute full gradients on all GPUs.
Gradient reduction: reduce-scatter to accumulate gradients — GPU i receives the reduced gradient for partition i only.
Optimizer step: GPU i updates only partition i's parameters using its local optimizer states.
All-gather: collect the updated fp16 parameters from all GPUs so every GPU has the full updated model.

Memory savings: Model state memory drops from (4Ψ + KΨ) = 16Ψ to (4Ψ + KΨ/N_d). For large N_d: 16Ψ → ~4Ψ, a 4× reduction.

Communication volume: The reduce-scatter (Ψ) + all-gather (Ψ) = 2Ψ, which is identical to the standard DP all-reduce. Zero additional communication!

This is a beautiful result: by simply rearranging which GPU stores which optimizer state partition, we get 4× memory savings for free.

3.2 ZeRO-DP Stage 2: Gradient Partitioning (P_os+g)

Mechanism: Since GPU i only needs gradients for partition i (to update its optimizer states), we only reduce gradients to their owning process. As soon as a gradient bucket is reduced and consumed, its memory is freed.

Implementation detail: Gradients are organized into buckets aligned with the parameter partitions. During the backward pass, as gradients become available, a reduce (not all-reduce) is performed to the responsible process. This is effectively a reduce-scatter, and after the reduce, non-owning processes can immediately release the gradient memory.

Memory savings: Gradient memory drops from 2Ψ to 2Ψ/N_d. Combined with P_os: model state memory = 2Ψ + 14Ψ/N_d ≈ 2Ψ for large N_d, an 8× reduction.

Communication volume: Still 2Ψ — the reduce-scatter (Ψ) + all-gather (Ψ) is the same as before. Zero additional cost.

3.3 ZeRO-DP Stage 3: Parameter Partitioning (P_os+g+p)

Mechanism: Each GPU only stores 1/N_d of the parameters. During forward and backward propagation, parameters are collected via all-gather on demand — just before they are needed for a layer's computation — and discarded immediately after.

Training flow:

Forward pass: for each layer, all-gather the parameters from all processes, compute, then discard the non-local parameters. This is pipelined — the all-gather for layer k+1 can overlap with computation for layer k.
Backward pass: same all-gather pattern in reverse order.
Gradient reduction: reduce-scatter as before.
Optimizer step: local update only.

Memory savings: All model states reduced to 16Ψ/N_d. With N_d = 64, a 7.5B model goes from 120 GB to 1.9 GB.

Communication volume: Forward all-gather (Ψ) + backward all-gather (Ψ) + gradient reduce-scatter (Ψ) = 3Ψ, which is 1.5× the baseline DP volume. This is the only stage with a communication increase, and it is modest.

Implication: With P_os+g+p, the trainable model size scales linearly with the number of GPUs. On 1024 GPUs, a 1 trillion parameter model needs only 16 GB per GPU — well within a 32 GB V100's capacity.

3.4 Memory Savings Summary (Table 1 from the Paper)

For a 7.5B parameter model with mixed-precision Adam (K=12):

DP Degree (N_d)	Standard DP	P_os	P_os+g	P_os+g+p
1	120 GB	120 GB	120 GB	120 GB
64	120 GB	31.4 GB	16.6 GB	1.88 GB
1024	120 GB	30.1 GB	15.1 GB	0.12 GB

For a 1 Trillion parameter model:

DP Degree (N_d)	Standard DP	P_os	P_os+g	P_os+g+p
1	16,000 GB	16,000 GB	16,000 GB	16,000 GB
1024	16,000 GB	4,011 GB	2,013 GB	15.6 GB

The progression from 16 TB to 15.6 GB per device is nothing short of extraordinary.

3.5 ZeRO-R: Partitioned Activation Checkpointing (P_a)

Problem: Model parallelism (when used alongside ZeRO-DP) replicates activations across MP processes. For a 100B model with MP=16 and batch size 32, activation checkpoints alone consume ~33 GB per GPU.

Solution: Partition the activation checkpoints across MP processes and reconstruct them via all-gather only when needed for backward recomputation.

Memory saving: Reduces activation memory by a factor of the MP degree (up to 16× on a DGX-2).

Communication overhead: Only one all-gather per transformer block (communication volume = seq_length × hidden_dim), which is less than 10% of the existing MP communication. For large models, the arithmetic intensity is so high (≥10K) that the data movement can be fully overlapped.

CPU offloading variant (P_a+cpu): For extremely large models, the partitioned checkpoints can be moved to CPU memory, reducing activation memory to nearly zero. This adds 2× CPU data movement compared to P_a but is only activated when beneficial.

3.6 ZeRO-R: Constant-Size Buffers (C_B)

High-performance libraries fuse all parameters/gradients into a single large buffer for operations like all-reduce (larger messages achieve higher bandwidth). But for a 3B parameter model, a 32-bit fused buffer requires 12 GB.

ZeRO-R uses fixed-size buffers — large enough for good communication efficiency but independent of model size. This prevents buffer memory from growing unboundedly with model scale.

3.7 ZeRO-R: Memory Defragmentation (M_D)

Problem: During training, short-lived tensors (intermediate activations, activation gradients) interleave with long-lived tensors (checkpointed activations, parameter gradients) in memory. This fragmentation can cause OOM errors even when 30%+ of memory is technically free — just not contiguous.

Solution: Pre-allocate contiguous memory chunks for long-lived data (activation checkpoints and parameter gradients) and copy data into these pre-allocated buffers on-the-fly. This eliminates fragmentation and also improves the memory allocator's efficiency (less time searching for contiguous free blocks).

4. Experiment Setup

4.1 Hardware

Cluster: 400 NVIDIA V100 GPUs (25 DGX-2 nodes)
Intra-node interconnect: NVSwitch (300 GB/s per link)
Inter-node interconnect: InfiniBand EDR (12.5 GB/s per link, 800 Gbps aggregate bandwidth)
GPU memory: 32 GB per V100

4.2 Implementation: ZeRO-100B

The full ZeRO (all three stages of ZeRO-DP + ZeRO-R) could theoretically train 1T+ parameter models, but the compute capacity of 400 V100s would make training impractically slow. Therefore, the evaluation focuses on ZeRO-100B: P_os+g (stages 1+2) of ZeRO-DP combined with all of ZeRO-R.

This subset enables training of models up to ~170B parameters — an order of magnitude beyond the state of the art.

4.3 Models

All models are GPT-2-like transformer architectures with varying hidden dimensions and number of layers:

Sizes from 1.5B to 170B parameters
Hidden dimensions: 1600 to 8192
Layers: 24 to 212

4.4 Baselines

Without MP: PyTorch Distributed Data Parallel (DDP)
With MP: Megatron-LM (open-source version, September 2019) — the state-of-the-art for large model training

4.5 Evaluation Metrics

Per-GPU throughput (TFlops/GPU)
Aggregate throughput (PetaFlops)
Maximum trainable model size
Scalability (speedup vs. GPU count)

5. Results & Analysis

5.1 Throughput and Model Size (Figure 2)

ZeRO-100B achieves sustained 15+ PetaFlops (38+ TFlops/GPU, over 30% of V100 peak) for models from 8B to 100B parameters on 400 GPUs.

Megatron-LM (baseline) degrades catastrophically beyond ~40B parameters. At 40B, Megatron requires MP across two nodes, and throughput drops to ~5 TFlops/GPU (<5% of peak). This is because cross-node MP uses InfiniBand (12.5 GB/s) instead of NVSwitch (300 GB/s) — a 24× bandwidth reduction.

Concrete comparison for 100B model: ZeRO-100B achieves ~38 TFlops/GPU. Megatron cannot even run a 100B model. This is a 10× improvement over what Megatron achieves at its maximum scale.

Maximum model size: ZeRO-100B can efficiently train up to 170B parameters on 400 GPUs. Megatron maxes out at ~20B with acceptable efficiency.

5.2 Super-Linear Scalability (Figure 3)

This is one of the most remarkable results: for a 60B parameter model, ZeRO-100B shows super-linear speedup when scaling from 64 to 400 GPUs. Doubling GPUs more than doubles throughput.

Why: ZeRO's P_os+g reduces per-GPU memory consumption as N_d increases. Less memory used for model states → more memory available for larger batch sizes → higher arithmetic intensity → better GPU utilization → higher per-GPU throughput.

This is the opposite of typical scaling behavior, where communication overhead causes sub-linear scaling. ZeRO turns the scaling game on its head.

5.3 Democratizing Large Model Training (Figure 4)

Without ZeRO, the largest model trainable with standard DP (no MP) is 1.4B parameters, limited by GPU memory. ZeRO-100B pushes this to 13B parameters — larger than T5 (11B) and Megatron-LM (8.3B), the largest published models at the time.

Throughput exceeds 40 TFlops/GPU for these models, and since no MP is needed, there is no requirement for expensive NVLink/NVSwitch interconnects. This opens large model training to clusters with commodity networking.

5.4 Memory and Performance Analysis (Figures 6-8)

The paper systematically evaluates five ZeRO configurations (C1-C5), combining different ZeRO-DP stages with ZeRO-R options:

Maximum model size progression:

C1 (P_os + C_B + M_D): 40B
C2 (P_os + C_B + M_D + P_a): 60B (activation partitioning frees 16× activation memory)
C4 (P_os+g + C_B + M_D + P_a): 140B (gradient partitioning halves model state memory)
C5 (P_os+g + C_B + M_D + P_a+cpu): 170B (CPU offloading of activations removes the last memory bottleneck)

Performance insight: Lower memory consumption → larger possible batch size → higher throughput. The exception is C5 vs C4 for 60B models: CPU offloading adds data movement overhead that outweighs the benefit unless the model is so large that it cannot run without CPU offloading.

5.5 Turing-NLG (17B Parameters)

ZeRO-100B powered the training of Turing-NLG, which at the time (February 2020) was the world's largest language model at 17.17 billion parameters. It achieved:

Perplexity: 10.21 on WikiText-103 (new state-of-the-art)
Throughput: 41.4 TFlops/GPU sustained
Significantly outperforming the previous SOTA (Megatron-LM 8.3B)

5.6 Theoretical vs. Measured Model Sizes (Table 2)

The paper validates its memory analysis by comparing theoretical maximum model sizes with measured values:

MP	GPUs	Theoretical Max (P_os)	Measured (P_os)
1	64	7.6B	6.2B
4	256	30.4B	25B
16	1024	121.6B	100B

The measured values closely match theoretical predictions (within ~20%), confirming that the memory analysis provides realistic upper bounds.

6. Limitations & Boundary Conditions

6.1 Communication Volume Increase in Stage 3

P_os+g+p increases communication by 1.5× over baseline DP. For bandwidth-constrained clusters (e.g., low inter-node bandwidth), this overhead may not be negligible. The paper only evaluates up to P_os+g in practice — the full P_os+g+p was not implemented at publication time.

6.2 Compute Capacity Gap for Truly Trillion-Parameter Models

The paper honestly acknowledges that while ZeRO can fit a 1T model on 1024 GPUs, the compute required to train it is staggering. A 1T model has ~3000× more computation per sample than BERT-Large. On the 1024-GPU cluster that trains BERT-Large in 67 minutes, a 1T model would take over 140 days — and likely over a year with realistic sequence lengths and dataset sizes. Exa-scale compute is needed.

6.3 No Stage 3 Implementation at Publication

The paper presents Stage 3 (parameter partitioning) theoretically but does not implement or evaluate it. The published implementation (ZeRO-100B) only includes Stages 1-2 plus ZeRO-R. (Stage 3 was later released in DeepSpeed.)

6.4 Batch Size Constraints and Convergence

ZeRO's super-linear scaling partially relies on increasing batch sizes as more GPUs reduce per-GPU memory pressure. However, there exists a "critical batch size" beyond which larger batches slow convergence. The paper acknowledges this but notes that for their evaluation regime, batch sizes remain below this threshold. For extremely large GPU counts, this could become a real concern.

6.5 Single Optimizer Focus

The analysis is centered on Adam with mixed-precision training (K=12). While the framework generalizes to other optimizers, the specific memory savings depend on K. For SGD (K≈0), the optimizer state savings from Stage 1 would be minimal.

6.6 GPT-2 Architecture Only

All experiments use GPT-2-like transformer models. While the memory analysis is general, performance results (throughput, scaling) may differ for other architectures (e.g., encoder-decoder models, mixture-of-experts).

6.7 ZeRO-R Requires Model Parallelism

The activation partitioning optimization (P_a) only helps when model parallelism is in use (since it removes the activation replication across MP processes). For DP-only setups, P_a provides no benefit.

7. Reproducibility & Practical Notes

7.1 Open Source: DeepSpeed

ZeRO is fully open-sourced as part of Microsoft's DeepSpeed library (https://github.com/microsoft/deepspeed). It has become one of the most widely used distributed training frameworks, powering models like BLOOM-176B, various GPT-NeoX models, and many industry training runs.

7.2 Ease of Use

This is ZeRO's killer advantage: no model code changes required. Users wrap their model with DeepSpeed's interface (compatible with torch.nn.Module) and get ZeRO-DP for free. This is in stark contrast to model parallelism (which requires model rewrites) and pipeline parallelism (which requires careful stage assignment and micro-batching).

Typical integration:

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config
)

ZeRO stages are selected via a JSON config file — switching between Stage 1, 2, or 3 requires changing a single number.

7.3 Practical Hardware Recommendations

Stage 1 (P_os): Use when model fits in memory with DP but you want to train larger batch sizes or slightly larger models. Minimal code change, zero communication overhead.
Stage 2 (P_os+g): The sweet spot for most large model training. 8× memory reduction with zero communication overhead. Use this by default for 10B+ models.
Stage 3 (P_os+g+p): Use when model states still do not fit with Stage 2. Accepts 1.5× communication overhead for N_d × memory reduction. Essential for 100B+ models with limited GPU count.

7.4 Estimated Compute Requirements

Based on the paper's results, training a 100B model to convergence on 400 V100 GPUs at 38 TFlops/GPU would take roughly:

~15 PFlops aggregate throughput
Training time depends on dataset and tokens, but expect weeks to months

For modern hardware (A100/H100 clusters), ZeRO Stage 2-3 remains highly relevant. The specific throughput numbers will be higher, but the memory optimization principles are hardware-agnostic.

7.5 Subsequent Evolution

Since publication, ZeRO has evolved significantly:

ZeRO-Offload: Offloads optimizer states and computation to CPU (not just activations)
ZeRO-Infinity: Extends offloading to NVMe SSDs for even larger models
ZeRO++: Optimizes communication with quantized gradients and hierarchical partitioning
FSDP (Fully Sharded Data Parallelism): PyTorch's native implementation of ZeRO-like principles, heavily inspired by this work

7.6 Tips for Practitioners

Start with ZeRO Stage 2 — it is the best default for large models (8× memory reduction, zero communication overhead).
Enable activation checkpointing alongside ZeRO for additional memory savings.
Monitor batch size — ZeRO frees memory that can be used for larger batches, improving throughput, but watch for convergence degradation.
Stage 3 with CPU offloading is a last resort — the throughput penalty is real. Only use it when you truly cannot fit the model otherwise.
Combine with MP only when needed for activation memory or batch size constraints, not for model state memory (ZeRO handles that better).

8. Deep Dive: Communication Analysis and Why ZeRO is Theoretically Optimal

8.1 Why Standard DP Communication is 2Ψ

In standard data parallelism, after the backward pass, every GPU has locally computed gradients that must be averaged across all N_d processes. The state-of-the-art implementation of all-reduce uses a two-step pipeline:

Step 1 — Reduce-scatter: The gradient tensor (Ψ elements) is divided into N_d chunks. Each chunk is reduced (summed) to a different process. Communication volume per process: Ψ elements (each process sends N_d-1 chunks of size Ψ/N_d each in a ring, totaling ~Ψ).

Step 2 — All-gather: Each process now holds one reduced chunk. The all-gather distributes all chunks to all processes. Communication volume per process: Ψ elements.

Total: 2Ψ. This is the communication cost baseline that ZeRO must match or closely approximate.

8.2 ZeRO Stage 1-2 Achieves Exactly 2Ψ

ZeRO's insight is that the reduce-scatter step of a standard all-reduce already distributes different gradient chunks to different processes. In standard DP, this is followed by an all-gather to ensure every process has all the reduced gradients. But with ZeRO, each process only needs the gradients for its partition — so the reduce-scatter is sufficient for the gradient phase.

After the optimizer step, each process has updated its parameter partition. Now an all-gather distributes the updated parameters to all processes (cost: Ψ). So:

Reduce-scatter of gradients: Ψ (same as the first half of standard all-reduce)
All-gather of updated parameters: Ψ (replaces the second half of standard all-reduce)
Total: 2Ψ — identical to standard DP

This is why Stages 1 and 2 are "free lunch" — the communication pattern is mathematically equivalent to standard DP's all-reduce, just reinterpreted to enable memory partitioning.

8.3 Stage 3's 1.5× Overhead is the Theoretical Minimum

With parameter partitioning, each process only stores 1/N_d of the parameters. During forward and backward passes, it needs the full parameters, so additional all-gathers are required:

Forward all-gather: Ψ (collect all parameters before each layer)
Backward all-gather: Ψ (collect parameters again for backward computation)
Gradient reduce-scatter: Ψ

Total: 3Ψ = 1.5 × baseline. This is the minimum possible communication: you must gather the parameters you do not own for both forward and backward passes, and you must reduce the gradients. No further reduction is possible without sacrificing correctness.

8.4 Communication-Memory Trade-off Space

The elegance of ZeRO's design becomes apparent when we map out the full trade-off space:

Configuration	Memory per GPU	Communication	Memory Reduction
Standard DP	16Ψ	2Ψ	1×
ZeRO Stage 1 (P_os)	~4Ψ	2Ψ	4×
ZeRO Stage 1+2 (P_os+g)	~2Ψ	2Ψ	8×
ZeRO Stage 1+2+3 (P_os+g+p)	16Ψ/N_d	3Ψ	N_d ×

The first two stages achieve massive memory reduction (4× and 8×) at zero communication cost. Only the third stage introduces overhead, and it is modest (50%) in exchange for linear memory scaling.

9. The Broader Impact: How ZeRO Changed the ML Training Landscape

9.1 Before ZeRO: The Scaling Wall

Before ZeRO, scaling model training required one of three painful approaches:

Model parallelism (Megatron-LM): Required expert knowledge to partition models, only worked efficiently within single nodes, and hit a wall at ~20B parameters due to cross-node communication costs.
Pipeline parallelism (GPipe, PipeDream): Introduced pipeline bubbles, convergence complications, and required batch sizes proportional to pipeline depth.
CPU offloading: Sacrificed up to 50% of training time on GPU-CPU data transfers.

Each approach forced practitioners to make painful trade-offs between memory efficiency, compute efficiency, communication overhead, usability, and convergence guarantees.

9.2 After ZeRO: The New Paradigm

ZeRO demonstrated that these trade-offs were not fundamental — they were artifacts of suboptimal system design. By recognizing that data parallelism's memory redundancy could be eliminated through smart partitioning without changing the underlying parallelization paradigm, ZeRO achieved the "best of both worlds":

Memory efficiency of model parallelism (or better)
Compute efficiency of data parallelism
Communication efficiency of data parallelism
Usability of data parallelism (zero model code changes)

9.3 The Legacy

The ideas in ZeRO have become foundational:

DeepSpeed (Microsoft's open-source library) has been used to train thousands of models, including BLOOM-176B, the largest open-source model trained by the BigScience collaboration.
FSDP (Fully Sharded Data Parallelism) in PyTorch implements ZeRO-like principles natively, bringing these optimizations to the broader PyTorch ecosystem.
Megatron-DeepSpeed combines Megatron-LM's tensor/pipeline parallelism with ZeRO's data-parallel memory optimization, enabling the training of the largest models (GPT-3-scale and beyond).
The concept of "partition storage, communicate on demand" has become a standard pattern in systems design for large-scale ML.

9.4 What ZeRO Got Right That Others Missed

The deepest insight in the paper is perhaps this: redundancy in distributed computing is usually seen as a feature (for fault tolerance), but in data-parallel training, it is pure overhead. No process ever needs all the optimizer states or all the gradients at the same time. By recognizing the temporal locality of state access during training (parameters are only needed during their layer's forward/backward pass, gradients are only needed during the reduction and update), ZeRO converts spatial redundancy into temporal communication, trading cheap bandwidth for expensive memory.

This "lazy materialization" principle — don't store what you can reconstruct on demand — is the core of ZeRO and has influenced system design well beyond just training optimization.

References

Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC 2020 / arXiv:1910.02054.
Shoeybi, M. et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053.
Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.
Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.
Micikevicius, P. et al. (2017). Mixed Precision Training. arXiv:1710.03740.
Huang, Y. et al. (2019). GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. NeurIPS 2019.
Narayanan, D. et al. (2019). PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP 2019.
Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). JMLR 2020.
Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. (GPT-2).

Review written on 2026-03-19.