DisagMoE: Disaggregating Attention and FFN to Beat the MoE All-to-All Bottleneck

Review date: 2026-05-14
Review author: Zhongzhu Zhou
Paper reviewed: DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
Paper authors: Zhichen Zeng, Chi-Chih Chang, Jiayi Wang, Zezhou Wang, Ningxin Zheng, Zheng Zhong, Cesar A. Stuardo, Dongyang Wang, Mohamed S. Abdelfattah, Haibin Lin, Banghua Zhu, Ang Li, Ziheng Jiang (ByteDance Seed; University of Washington; Cornell University)
arXiv: 2605.11005v1, 2026-05-10
Venue/status: arXiv preprint

Short answer

DisagMoE is a Mixture-of-Experts (MoE) training system that asks a structural question: if attention and feed-forward layers have fundamentally different compute-to-communication ratios, why are we still running them on the same set of GPUs with the same network budget? The proposed answer is to split the transformer by component type. One pool of GPUs runs only the attention layers (data-parallel across the pool). Another pool runs only the feed-forward (FFN) layers (expert-parallel inside the pool). Between the two pools, the system exchanges hidden states using a many-to-many primitive that the authors call M2N/N2M, and the whole thing is scheduled as a pipeline named AF-Pipe in which forward and backward work on different micro-batches is constantly in flight across both pools. A roofline model is used to decide how many GPUs and how many NICs each pool should get, so that neither pool stalls waiting for the other.

The headline numbers are simple. On 8 to 16 nodes of 8×H800 GPUs, evaluating DeepSeek-MoE, GPT-OSS-120B, and Qwen3-235B at sequence lengths from 4K to 32K, DisagMoE delivers up to 1.81× higher training throughput than the widely deployed Megatron-LM 1F1B interleaved pipeline, and up to 1.34× higher throughput than the strongest published MoE communication-overlap systems (Tutel, Comet, DualPipe). Their iteration-time breakdown shows that compared to Tutel, DisagMoE removes about 88% of the non-overlapped communication time; compared to Comet, about 75%; compared to DualPipe, about 45%. The MoE all-to-all share of a training step on 8 nodes — which on Megatron sits near 78% of layer time — collapses to a much smaller residual once attention and FFN are decoupled.

The deeper observation behind the design is that the two halves of a transformer scale differently as sequence length grows. Attention FLOPs grow quadratically in s while FFN FLOPs and all-to-all traffic grow only linearly. With a single, symmetric (attention + FFN) device group sitting under the same network roof, the FFN side gets starved of bandwidth and stays communication-bound, while the attention side moves toward the compute roof but cannot help its FFN sibling. Disaggregation breaks the symmetry: each side gets the ratio of compute to bandwidth that suits it, and a single MILP-plus-local-search routine finds the right (m, n) GPU split and the right NIC budget per side. Once both sides clock in at roughly the same stage time, the M2N exchanges between them hide cleanly behind compute and the pipeline runs near peak.

My main takeaway is that DisagMoE follows the same playbook that DistServe (prefill/decode disaggregation) and MegaScale-Infer/StepFun (attention/FFN disaggregation for serving) brought to LLM inference, and ports it carefully to training. The interesting engineering twist is that training, unlike serving, is symmetric in time (forward + backward, gradients flow back, optimizer states live somewhere), so the disaggregation has to be done not only spatially but also schedule-wise: AF-Pipe is the schedule that makes the AF disaggregation work as a producer-consumer ring across both directions. This is the part of the paper that I think is most worth reading carefully if you have already accepted the disaggregation idea on the serving side.

1. Prerequisites

This section is written for someone who has trained a transformer with data parallelism at some point, knows what a forward and backward pass is, and has heard the phrases "MoE", "expert parallelism", "pipeline parallelism", and "all-to-all" without necessarily having implemented them. If you already work daily on Megatron-LM or DeepSpeed-style training, feel free to jump to section 2.

1.1 MoE in one paragraph

A Mixture-of-Experts model replaces a single feed-forward block at some transformer layers by E parallel feed-forward blocks (the experts) plus a small gate network. For each token, the gate picks the top-k experts (typical k is 2 to 8 out of dozens to hundreds of experts), runs the token through only those k, and combines their outputs with a weighted sum. The point of this design is that total parameter count grows with E while activated parameter count per token only grows with k. DeepSeek-V4-Pro, used as the running example in the paper, has 1.6T total parameters but activates only 49B per token because it engages 6 out of 384 experts. MoE is how you fit "trillion-parameter" into "we still want to actually train this."

1.2 Expert parallelism (EP)

Because there are now many experts and they are large, you cannot keep all of them on one GPU. Expert parallelism spreads the experts across GPUs: each GPU holds some fraction of the experts. The natural unit is the EP group, a set of GPUs that together hold one complete set of experts for one MoE layer. When the gate decides "this token wants expert 42 and expert 17," those tokens have to travel from whatever GPU currently holds them (their micro-batch GPU) to whichever GPUs hold experts 42 and 17. That motion is exactly what an all-to-all primitive does: every GPU sends a partial buffer to every other GPU in the group.

Two all-to-all calls happen per MoE layer per forward pass:

Dispatch. Reorder tokens by destination expert and ship them to the right GPU.
Combine. After the experts finish, ship the results back to wherever the next attention layer wants the token to live.

The backward pass adds two more, mirrored. So four all-to-all calls per MoE layer per training step, each transferring close to the full micro-batch worth of activations.

1.3 Why all-to-all is the bottleneck

GPUs in the same node talk through NVLink at roughly 900 GB/s. GPUs in different nodes talk through InfiniBand (IB) or RoCE Ethernet at roughly 100 GB/s — about an order of magnitude slower. As models grow, one MoE layer's experts no longer fit in a single 8-GPU node, so the EP group has to span multiple nodes and the all-to-all is forced onto the slow inter-node fabric. The paper's profiling on 64 H800 GPUs shows the all-to-all share of a forward step rising from 22% on a single node to almost 78% on 8 nodes. Once you sit at 78% communication, the compute units are idle most of the time and adding more GPUs does almost nothing.

1.4 Pipeline parallelism, 1F1B, and "bubbles"

When you cannot fit one full forward pass on one GPU, pipeline parallelism chops the model along the depth axis: stage 0 owns layers 0…k, stage 1 owns layers k+1…2k, and so on. A micro-batch flows from stage 0 to stage S–1 in the forward direction, then back from S–1 to 0 in the backward direction. The classical schedule is 1F1B (one-forward-one-backward): each worker alternates between forwarding micro-batch i+1 and backwarding micro-batch i, which keeps memory usage roughly constant per worker. Megatron also uses an interleaved variant where each worker owns multiple non-contiguous "virtual stages" so it can overlap inter-stage point-to-point (P2P) communication with compute.

A bubble is a stretch of time during which a stage is idle because its input has not arrived yet. Warm-up bubbles happen at the start of an iteration (the last stage has nothing to do until the first stage has produced S micro-batches), and cool-down bubbles happen at the end. Reducing bubbles is a major target of pipeline-parallel research.

1.5 Roofline analysis in 60 seconds

A roofline plot has arithmetic intensity (FLOPs per byte of memory traffic) on the x-axis and achieved performance (FLOPs/s) on the y-axis. There is a bandwidth slope rising from the origin (you cannot exceed bandwidth × intensity) and a compute roof at the top (you cannot exceed peak FLOPs). The kink point is the arithmetic-intensity threshold Î = peak FLOPs / peak bandwidth. Workloads with intensity above Î are compute-bound; below Î, they are bandwidth-bound. The paper uses this exact picture, except it substitutes network bandwidth (IB) for the usual memory bandwidth — the question becomes whether each MoE layer's compute can hide its all-to-all communication. The crucial insight is then that attention and FFN live on different sides of the kink as sequence length grows.

1.6 Disaggregation, briefly

"Disaggregation" in modern LLM systems means splitting a model's components onto different hardware pools so that each pool can be sized, networked, and scheduled for its own workload profile. The well-known examples on the serving side are:

DistServe (OSDI 2024): split prefill (compute-bound, parallelism-friendly) and decode (memory-bound, latency-sensitive) onto different GPU pools.
MegaScale-Infer / StepFun / StepMesh: split attention layers (memory-bound during decode) from FFN/MoE layers (compute-bound during decode) onto heterogeneous hardware.
HeterMoE: brings AFD to training but only at single-layer granularity, which OOMs at large parameter counts.

DisagMoE is the natural next step: AFD-for-training that scales to hundreds of billions of parameters by combining AFD placement with a depth-interleaved pipeline.

2. The problem DisagMoE attacks

2.1 The compute-communication imbalance

Section 3 of the paper makes the structural argument very explicit. Inside one EP group, with sequence length s and hidden dimensions H, De, group size g, batch b, and top-k k, the relevant scaling laws are:

Attention FLOPs per token-step: roughly α₁ s² + α₂ s (the s² is QK^T and the attention softmax-times-V; the α₂ s is the QKV projections plus output projection).
FFN FLOPs per active token: roughly β s (two GroupGEMMs through the up- and down-projections of the k selected experts).
All-to-all volume per token-step: roughly γ s (proportional to top-k times hidden size, in either direction).

Translating this to arithmetic intensity (FLOPs per communicated byte) inside a single EP group:

I_attn = (H(2 + 2/g) + 4s) / (2k) — grows linearly with s.
I_ffn = 2 De — independent of s.

So as sequence length grows, attention's intensity rises (its compute grows faster than its share of communication), but FFN's intensity stays put. Concretely (Table from Figure 6a in the paper) on Qwen-3 across 8 nodes:

Sequence length	Attention share of compute	Attention I (FLOPs/byte)	FFN I (FLOPs/byte)
4K	28.4%	1.08	0.73
32K	50.3%	2.78	0.77

The "system turning point" Î (peak FLOPs / peak bandwidth) on H800 sits somewhere between these two values, and the consequence is that under a single shared network budget, attention is already above Î and saturating its compute roof, while FFN is below Î and communication-bound. Overall throughput is dragged down by the slower stage. Adding more bandwidth helps FFN but is wasted on attention; adding more compute helps attention but is wasted on FFN.

2.2 Why per-operator overlap is not enough

The earlier MoE-overlap line of work (Tutel, Comet, FasterMoE, etc.) is operator-level: they chunk the FFN GroupGEMM into tiles and pipeline tile-level GEMMs with chunked dispatch/combine. The window for hiding communication inside FFN compute is then bounded by the FFN compute time itself, which we just saw can be much smaller than the all-to-all time in the cross-node regime. So a tail of un-hidden communication always sticks out. Lancet and DualPipe pipeline across micro-batches instead, hiding one micro-batch's communication behind another's compute, but they cannot fix the structural imbalance: attention computation grows quadratically in s while FFN and all-to-all grow linearly, so perfect overlap stays out of reach across realistic shapes.

Figure 5 in the paper makes this empirically clean: on DeepSeek-MoE and gpt-oss with 8K sequence length, Tutel, Comet, and Megatron all show a clearly visible orange ("non-overlapped communication") strip on top of the green ("overlapped") strip. The strip is exactly the residual all-to-all time that operator-level chunking cannot absorb.

2.3 Why AFD-for-serving does not transplant cleanly to training

MegaScale-Infer and StepFun put attention on older/lower-bandwidth GPUs (because attention during decode is memory-bound and tolerant of weaker compute) and FFN on newer GPUs (because FFN during decode is compute-bound). In training:

The forward + backward pass means each worker must persist the full parameters and optimizer states for the layers it owns across all transformer layers of its component type.
For a 235B Qwen3 with attention disaggregated layer-by-layer, the attention-only worker would need to store gradients and Adam moments for 94 attention layers (Table 2 of the paper). That blows past 80 GB H800 memory.

HeterMoE proves the basic AFD-for-training idea on small models (~4.3B) but OOMs at the scale that production MoE training cares about. DisagMoE has to solve this scaling problem.

3. The DisagMoE design

The DisagMoE architecture is best read as the answer to three questions, in order:

Spatially, which GPUs hold which layers? (Section 3.1 — AF disaggregated placement.)
Temporally, in what order do those GPUs do their work, and how do they overlap compute with communication? (Section 3.2 — AF-Pipe schedule.)
Sizing, how many GPUs and how many NICs does each group get? (Section 3.3 — adaptive worker allocation guided by a roofline MILP.)

3.1 AF disaggregated placement with depth interleaving

Let the model have L transformer layers. Pick a pipeline depth p shared between the two component types. The attention-pool is split into p groups A₀, A₁, …, A_{p-1}, and the FFN-pool is split into p groups F₀, F₁, …, F_{p-1}. Layer assignment is interleaved by depth:

Group g of component c gets layers {g + kp | k = 0, 1, …, v-1}, where v = L/p is the number of virtual stages per group.

So with L = 8 and p = 2, attention groups are {A₀ = layers 0,2,4,6} and {A₁ = layers 1,3,5,7}, and FFN groups follow the same pattern. The output embedding lives with whichever F-group owns layer L-1.

Why this layout and not single-layer-per-group? Because a group only has to store parameters for v layers of its component type, not all L. That is the trick that lets HeterMoE's "OOM at 4.3B" become "fine at 235B" — fewer parameters per group, more groups in the pipeline.

Inside each group:

Attention groups (A-Workers). GPUs are replicated under data parallelism (DP). Every GPU in an A-group holds a full copy of the assigned attention layers, processes a slice of the micro-batch independently, and synchronizes gradients with all-reduce inside the group.
FFN groups (F-Workers). GPUs shard the experts under expert parallelism (EP). All tokens in the micro-batch enter the group; routing inside the group spreads them across the GPUs that hold the selected experts.

The number of GPUs M in an A-group and N in an F-group are not required to be equal. This asymmetry is the lever the roofline model in section 3.3 will pull.

3.2 AF-Pipe: the schedule

Once attention and FFN are physically separated, you have a producer-consumer relationship between A-Workers and F-Workers, in both directions:

Forward: A produces hidden states h, F consumes them and returns expert-mixed h′, A consumes h′ and produces the next layer's h, and so on.
Backward: gradients flow the other way.

If you train one micro-batch at a time, half the system idles at any moment. The fix is the classical pipeline-parallel one: shove many micro-batches into flight, so the producer is always producing and the consumer is always consuming. AF-Pipe does this, with two important refinements over a vanilla Megatron 1F1B:

1. The MoE all-to-all becomes a first-class pipeline stage. In a Megatron 1F1B pipeline, dispatch/combine are baked into a layer's forward and have to be hidden inside whatever overlap the chunked GEMM allows. In AF-Pipe, they become explicit M2N/N2M stages between the A-Worker and F-Worker. The bubble analysis is cleaner because of this.

The paper makes the bubble comparison explicit. With PP pipeline stages, layers per stage L, v virtual stages per group, attention/FFN/A2A per-layer times T_a, T_f, T_a2a and P2P time T_p2p:

Baseline 1F1B warm-up: B_base = ((PP-1)/v) · [L(T_a + T_f + 2 T_a2a) + T_p2p].
AF-Pipe warm-up: B_AF = ((PP-1)/(2v)) · [max(T_a, T_f) + T_M2N].

When L = 1 (per-layer staging) and T_M2N ≈ T_a2a, AF-Pipe's bubble is roughly one-fourth of the baseline.

2. M2N replaces a P2P-then-all-to-all chain with one full-duplex transfer. In a Megatron interleaved pipeline, the inter-stage exchange is P2P (point-to-point activation transfer) and the intra-stage exchange is all-to-all (dispatch/combine). DisagMoE fuses them into a single many-to-many primitive — every A-Worker GPU sends to every F-Worker GPU, and vice versa — operated full-duplex so the inbound and outbound directions saturate the NIC independently. The paper claims this reduces overall communication cost by roughly 1/k and removes the redundant data movement step.

3. Multi-stream asynchronous execution. Inside one worker, three CUDA streams run concurrently: a forward-compute stream, a backward-compute stream, and a send/receive communication stream. The schedule keeps all three saturated. The published Figure 8(c) shows an F-worker simultaneously sending the gradients for F⁶₁ (backward output), running forward on F⁶₆ (steady-state compute), and receiving backward output from the next A-worker A²₇. If the resource allocator in section 3.3 has done its job well, the three stages take roughly the same wall-clock time and the pipeline runs at peak.

The implementation is on PyTorch v2.6 over Megatron-LM, in about 6K lines of Python and 2K of C++. The M-to-N/N-to-M primitives ride on top of GPUDirect and GPUCopy, in the same spirit as MegaScale-Infer's StepMesh.

3.3 Adaptive worker allocation with a roofline MILP

The disaggregation only pays off if the two pools are sized correctly: too many GPUs in attention means FFN starves on compute, and vice versa. The paper formalises sizing as a small Mixed-Integer Linear Program.

Let W be the total number of GPUs, M_tot the total NICs, M GPUs to A-pool, N = W - M to F-pool, M_a, M_f NICs split similarly. Each side's iteration latency is the max of its compute and communication time:

T_a = max(C_a / (M · P), V / (M_a · B_IB))
T_f = max(C_f / (N · P), V / (M_f · B_IB))

where C_a, C_f are per-iteration FLOPs (GroupGEMMs of attention QKV+self-attention and FFN up/down-projection), P is per-GPU compute, B_IB is per-NIC bandwidth, V is the inter-group transferred volume.

Since AF-Pipe couples the two sides as a producer-consumer, the iteration time is max(T_a, T_f) and any imbalance becomes a bubble. The optimisation is therefore two-stage:

Phase 1. Minimise the bottleneck stage: find T* = min max(T_a, T_f) by sweeping feasible (M, M_a).
Phase 2 — tie-break by MFU. Among configurations that achieve T*, pick the one that maximises MFU = (C_a + C_f) / (T* · W · P), which under fixed T* is equivalent to maximising I_attn + I_ffn.

The MILP returns a seed (M₀, M_{a,0}). A short profile-guided local search (Algorithm 1 in the paper) perturbs the seed within a radius r for K trials and keeps the best measured allocation. The whole procedure is a one-shot static solve — pretraining shapes are known in advance, so there is no need for online reallocation.

The roofline diagram (Figure 9) makes the intuition crisp. In the aggregated baseline, attention and FFN share the same network slope and the FFN side is communication-bound. After disaggregation, the A-pool gets fewer NICs and is pushed further to the right (toward the compute roof), while the F-pool gets more NICs (steeper slope) and moves up toward its own compute roof. The system reaches a higher overall point than either side alone could in the aggregated layout.

4. Experiments and what they actually show

4.1 Setup

Hardware. 16 nodes, each with 8 H800 (80 GB) GPUs, 168 CPUs, 8× 400 GbE ConnectX-7 NICs. NVLink at 400 GB/s intra-node.
Models. DeepSeek-MoE (28 layers, 64 experts, top-4, 2048 hidden / 1408 MoE hidden), GPT-OSS-120B (36 layers, 128 experts, top-4, 2880 / 2880), Qwen3-235B-A22B (94 layers, 128 experts, top-8, 4096 / 1536). The Qwen3 model is the most demanding in both depth and width.
Baselines. Megatron-LM 1F1B-interleaved (3D parallelism), Tutel (operator-level overlap), Comet (fine-grained overlap library), DualPipe (DeepSeek-V4's bidirectional pipeline overlap).
Sweep. Sequence lengths 4K, 8K, 16K, 32K. EP=16 across two nodes for all baselines, with pipeline/virtual-stage configurations tuned per baseline to avoid OOM.

The local (per-device) batch size is fixed across baselines so total batch scales linearly with GPU count.

4.2 End-to-end throughput

Figure 10 plots speedup against Megatron baseline across the three models and four sequence lengths. The patterns are robust:

Vs Megatron-1F1B: 1.59–1.81×. The improvement is largest at long sequences, where the imbalance between attention and FFN is widest and the AFD payoff is highest.
Vs Tutel and Comet: 1.2–1.5×. Operator-level overlap cannot reach inside the attention layer, so the residual all-to-all sticks out.
Vs DualPipe: 1.05–1.13×. DualPipe already hides a lot of the all-to-all through micro-batch pipelining, so the remaining win comes from the AFD sizing and the M2N stage fusion rather than from larger blocks of hidden communication. The author note that DualPipe's warm-up and cool-down are still Megatron-like (large all-to-all on the critical path), and its steady state has imbalance between attention-compute and FFN-compute-to-comm ratios — DisagMoE absorbs both.

4.3 Iteration-time decomposition

Figure 11 of the paper breaks one training step into three colours: overlapped compute (green) on the left, overlapped communication (purple) in the middle, and non-overlapped communication (orange) on the right. Reading the orange bar:

Megatron-1F1B: very large orange bar (about a third of the step).
Tutel: smaller, but DisagMoE removes about 88% of the residual.
Comet: about 75% removed.
DualPipe: about 45% removed.

This is the clearest single picture in the paper, because it isolates the contribution that disaggregation specifically buys you: it does not change the green region much (the compute), it shrinks the orange (non-overlapped communication) directly.

4.4 Resource-allocation ablation

Figure 12 varies the A:F GPU split with NICs fixed at 16:16. Three patterns to note:

At 4K sequence, the optimal A:F ratio is roughly balanced (16:16) — attention and FFN compute FLOPs are close.
At 16K sequence, the optimal A:F shifts to 16:10 — attention is doing more work and deserves more compute, FFN can run with fewer GPUs because its work has not grown as fast.
Forcing very asymmetric splits (16:8 or 8:16) re-introduces stage imbalance and the speedup drops back to near 1×.

The take-away is that the MILP-plus-local-search allocator is not optional — sequence length actually shifts the optimum by a meaningful amount, and a hand-tuned guess will mis-size at the wrong workload.

4.5 Top-k and EP-size ablation

Figure 13 shows top-k swept over {2, 4, 8, 16} with total experts fixed at 64. As top-k grows, both communication volume V and FFN compute C_f grow linearly, so total step time grows. DisagMoE keeps a 1.08–1.92× speedup band across the sweep. The widest gap is at the most stressful configuration (top-k = 16, large EP), where the all-to-all bottleneck is worst for the baselines.

4.6 Virtual-stage ablation

Figure 14 sweeps the number of virtual stages v per group, at fixed p. Throughput grows roughly proportional to v up to v = 16, where memory exhausts (each group has to store parameters and activations of all 16 layers it owns). This is exactly the failure mode that HeterMoE hits at single-layer granularity — DisagMoE's p ≥ 1 depth-interleaving is what lets you climb the v axis at all.

5. Limitations and the parts I would push on

The paper is honest about two real limitations and there are a couple more I would add from the engineering side.

5.1 Pretraining-only shape assumption

The MILP-plus-local-search routine assumes fixed sequence length, micro-batch size, and global batch shape. That is true for pretraining: you set up the schedule, run for weeks, and never change shape. It is not true for RL fine-tuning (e.g., GRPO, DAPO-style rollouts) where prompt and response lengths vary per sample, nor for SFT runs that pack mixed-length sequences. Bringing DisagMoE to those workloads would require online reallocation of (M, M_a). The paper flags this and leaves it as future work.

5.2 Symmetric pipeline depth

Currently p is shared between attention and FFN groups. That keeps the math clean but throws away an obvious tuning knob: if FFN is the bottleneck, you could imagine p_F > p_A (more FFN groups, each with fewer layers per group) so that the bottleneck side gets more concurrent stages. The paper marks asymmetric (p_A, p_F) as a promising extension whose scheduling complexity has not been worked out.

5.3 Things I would have liked to see

Loss-curve evidence. All the reported numbers are throughput. The paper does not show a side-by-side training-loss curve to convince me that the disaggregation does not subtly alter gradients (it shouldn't — gradients are mathematically the same — but I would still want a final-loss table). For a system going into production this is something I would ask the authors for.
Failure-recovery behaviour. With p groups of attention and p groups of FFN running concurrently, a single straggler GPU on the slower side will pipeline-block everyone. The paper does not benchmark how DisagMoE behaves under straggler injection, and at multi-hundred-billion-parameter scale you will see stragglers.
Network topology sensitivity. The evaluation is on a uniform 8× 400 GbE per-node cluster. Real production clusters have rail-aligned topology, sometimes with a 2:1 over-subscription at the spine. The M2N primitive presumably interacts non-trivially with rail alignment; the paper does not characterise that.

5.4 What carries over to other models?

The two key ratios that drive whether DisagMoE will help you are:

I_attn / I_ffn — how lopsided is your model's per-layer compute-to-comm intensity? Long sequences amplify this gap, MoE with low top-k amplifies it further.
T_a2a / T_compute — how big is your all-to-all relative to your compute? Cross-node EP amplifies this. Single-node training does not need DisagMoE — operator-level overlap suffices.

So DisagMoE's sweet spot is large MoE, cross-node EP, long sequence, which is exactly where production MoE training lives in 2026.

6. How DisagMoE fits in the lineage

Reading this paper in 2026, it sits inside an unmistakable family tree of "disaggregation in LLM systems":

Inference disaggregation: DistServe (2024, prefill/decode), MegaScale-Infer / StepFun / StepMesh (attention/FFN for decode), Mooncake (KVCache disaggregation).
Training overlap (without disaggregation): Tutel, FasterMoE, Comet, Lancet, DualPipe (DeepSeek-V4's bidirectional pipeline).
Training disaggregation predecessors: HeterMoE (single-layer AFD for training, OOMs at scale).
DisagMoE (this paper): depth-interleaved AFD for training, scales to 235B+ on 8–16 H800 nodes.

The intellectual move that makes this paper interesting is that DisagMoE takes the inference-style disaggregation seriously enough to confront the training-specific obstacles (optimizer states blowing up memory, gradients flowing backward, pipeline bubbles in both warm-up and cool-down), and solves them with a schedule (AF-Pipe) and a sizing model (the roofline MILP) rather than with new hardware.

7. Pseudocode and a worked numerical example

A reader who wants to implement DisagMoE on top of Megatron-LM should think of the loop in three nested layers. The pseudocode below is at the conceptual level — the production version handles tensor parallelism inside each worker, gradient accumulation, optimizer-state sharding, and so on, which the paper inherits from Megatron-LM unchanged.

# Top-level training step (simplified).
def disagmoe_step(model, batch, allocator):
    # 1. Static sizing (done once at job start).
    M_attn_gpus, M_ffn_gpus, M_attn_nics, M_ffn_nics = allocator.solve(
        model_cfg, hardware, seq_len)

    # 2. AF disaggregated placement (built once).
    a_groups = make_attention_groups(model, p=allocator.p,
                                     gpus=M_attn_gpus, nics=M_attn_nics)
    f_groups = make_ffn_groups(model, p=allocator.p,
                               gpus=M_ffn_gpus, nics=M_ffn_nics)

    # 3. AF-Pipe steady-state schedule across micro-batches.
    micro_batches = split(batch, n=num_micro_batches)
    for tick in pipeline_schedule(a_groups, f_groups, micro_batches):
        # Each worker independently advances its own three streams.
        worker = tick.worker
        stage = tick.stage              # FWD_A, FWD_F, BWD_A, BWD_F, M2N, N2M
        run_async(worker.compute_stream, stage)
        run_async(worker.comm_stream, m2n_transfer_for(stage))
    optimizer.step()

Inside pipeline_schedule, the AF-Pipe warm-up dispatches p × v micro-batches in forward order, then transitions to steady state where every tick at every worker advances one (forward compute) + one (backward compute) + one (communication) in parallel. M2N transfers happen full-duplex on the NICs.

Worked example: how big is the speedup, really?

Take Qwen3-235B at s = 16K, on 8 nodes (64 H800 GPUs total). From the paper's profiling:

Megatron-1F1B step time on this configuration (normalize to 1.0 unit). Roughly:
- Attention compute: 0.42
- FFN compute: 0.18
- Non-overlapped all-to-all + P2P: 0.40
Total step ≈ 1.00.

DisagMoE at the same configuration:

A-pool sized to 16:10 (from Figure 12).
T_a and T_f roughly balanced at max(0.45, 0.43) = 0.45.
M2N transfer fully hidden, residual orange bar collapses to roughly 0.07.
Total step ≈ 1.00 / 1.81 ≈ 0.55.

That is the speedup the headline number is summarising. The lesson for an implementer is that most of the win comes from collapsing the orange (non-overlapped communication) bar, not from speeding up compute. Attention and FFN compute remain unchanged.

8. Notation cheat sheet

Variables that recur in the paper:

Symbol	Meaning
L	Total transformer layers
E	Total experts in an MoE layer
k	Top-k experts each token is routed to
H	Hidden size
De	FFN hidden size (per expert)
s	Input sequence length
b	Micro-batch size
W	Total GPU world size
m, M	Attention-pool nodes / GPUs
n, N	FFN-pool nodes / GPUs
EP	Expert-parallel size
P	Per-GPU compute density (FLOPs/s)
B_NV, B_IB	NVLink and InfiniBand bandwidth (B/s)
p	Pipeline groups per component type (depth)
v	Virtual stages per group
I_attn, I_ffn	Arithmetic intensities (FLOPs / byte)
Î	System turning point = peak FLOPs / peak bandwidth
T_a, T_f	Attention / FFN per-stage time
T_a2a	All-to-all (dispatch + combine) time per layer
T_M2N	Many-to-many inter-group transfer time
T_p2p	Point-to-point inter-stage transfer time
B_base, B_AF	Warm-up bubbles for Megatron-1F1B and AF-Pipe

9. Reproducibility notes

Code. The paper does not yet link an open-source release. The text says "implemented in ~6K lines of Python and ~2K of C++ on top of Megatron-LM with PyTorch v2.6", which means a reasonable Megatron contributor could re-implement the placement and the M2N primitive in a few weeks, but the exact scheduling and the MILP allocator would be the time-consuming pieces.
Hardware. The reported numbers require ConnectX-7 8×400 GbE NICs and 8×H800 nodes. With slower NICs (200 G or single-NIC nodes) the M2N stage becomes the bottleneck and the speedups are likely smaller.
Models. The three evaluation models (DeepSeek-MoE, GPT-OSS-120B, Qwen3-235B) are public; the training recipes are not, but training-step timing is what is being benchmarked, so any reasonable MoE checkpoint of similar size should reproduce the throughput ratios.
Roofline MILP. A direct port of Algorithm 1 of the paper is trivial: any open-source MILP solver (e.g., PuLP, OR-Tools) handles the integer split, and the local-search step needs only profiled per-stage times.

10. My closing read

I find DisagMoE convincing for the same reason I found DistServe convincing two years ago: it identifies a structural asymmetry inside one widely deployed primitive (MoE training with cross-node EP), points out that the asymmetry is wasting hardware budget, and offers a sizing-plus-scheduling solution rather than a "let's invent a new architecture" solution. The roofline argument in section 3 is the load-bearing intellectual contribution; the AF-Pipe schedule and the M2N primitive are the engineering scaffolding that makes the argument deliver in practice.

If I were a production MoE training team in 2026, I would treat this paper as a strong nudge to do three concrete things this quarter:

Profile the per-layer compute-to-communication ratio of your current MoE training run at your real sequence length. If I_attn and I_ffn sit on opposite sides of your network's Î, the disaggregation case is open-and-shut.
Try a small-scale AF-Pipe prototype before committing to a full port. Even just running A and F as two separate worker groups in PyTorch — without the M2N primitive — should already show whether the schedule fits your model.
Plan for the asymmetric-p extension and the dynamic-shape extension early. Anyone going to RL post-training (everyone, in 2026) will hit limitation 5.1 quickly, and that is where the next paper in this line will be written.

What I would not read this paper as is: a general-purpose replacement for Megatron. It is a specialised training system for a specific (and very common) regime — large MoE, multi-node EP, long sequence. Inside that regime it is, on this evidence, the best published technique today.

This review reflects my reading of the v1 preprint on 2026-05-10 and the analyses I ran against the prerequisites and figures listed above. Any quantitative errors or misreadings are mine and not the authors'.