--- | Zhongzhu's Blog

1. Mixture of Experts (MoE)

Each token routes to a small subset of expert networks (e.g., top-k routing)
Only the routed experts are computed; others are silent
Enables scaling to trillion-parameter models without proportional compute growth

2. Expert Parallelism (EP)

Distribute experts across devices: each device holds a disjoint subset
Token load per device depends on router decisions, which vary per micro-batch
Uneven load = some GPUs idle while waiting for others

3. Hardware: NVLink & Hopper GPUs

NVLink 4.0: 900 GB/s bidirectional bandwidth between intra-node GPUs
Hopper GPU: includes a dedicated Copy Engine (DMA-like unit)
Copy Engine operates independently from SMs—zero compute interference
Key capability: can move data while compute kernels run

4. Pipeline Parallelism (PP) & Synchronous Training

Multiple pipeline stages, all synchronized by slowest device
Imbalance directly impacts wall-clock time

5. Grouped GEMM

Specialized batched matrix multiply for MoE dispatch
Sensitive to batch size distribution; splitting tokens can degrade perf

The Problem: Load Imbalance in MoE

Root Cause

In expert parallelism, the router is learned end-to-end without constraints. It assigns tokens based on learned affinities to experts, not fairness. Result: per-device token counts vary randomly across micro-batches, even during stable training. This variation is data-dependent—different training corpora produce different routing distributions.

Why This Matters

Figure 1(b) from the paper shows the waste:

Token straggler: The slowest device gets more tokens than average.
GEMM straggler: Wall-clock time difference between slowest and average device.
Quantified waste: 18.6% of GPU time per MoE layer is lost to synchronization overhead.

At scale (128 experts, up to 16 H100s), this is enormous—hours of wasted compute per day.

Why Prior Work Fails

Three main approaches tried to fix this:

Coarse-grained mitigation (auxiliary losses)
- Force the router to produce balanced assignments
- Constraints degrade model quality and expressiveness
- Doesn't fully eliminate imbalance
Dynamic scheduling with overhead (FasterMoE, Tutel, SmartMoE)
- FasterMoE: replicate "hot" experts (shadow experts) and pipeline dispatch
- Problem: splitting communication into stages adds volume, not just reduces latency
- When routers change unpredictably, prediction degrades
- Tutel: switch between parallel modes, but partition weights → extra communication
SM-based communication overlap (Triton Distributed)
- Fuse computation and communication kernels
- Problem: kernels consume SM resources, reducing available compute
- Reduces efficiency instead of improving it

The deeper issue: Specialized MoE backends (DeepEP, FUSCO) do bulk transfers without staged delivery. You can't split their communication into fine-grained pipelined stages without paying extra volume penalty.

The Solution: Two-Phase Dispatch with NVLink CE

Design Principle: Orthogonal Parallelism

FEPLB's central insight is resource-level separation:

EP & PP use: RDMA NICs (inter-node) + GPU SMs (compute)
FEPLB uses: NVLink Copy Engine (intra-node, no SMs) + CPU scheduler

Because these resource sets don't overlap, FEPLB is a true new parallel dimension, not a scheduling trick. It coexists with existing parallelism without reconfiguration.

Two-Phase Dispatch

Phase 1 (EP routing):

Tokens route to their assigned devices via standard EP backend (e.g., DeepEP)
Static experts process normally on assigned devices
Dynamic expert tokens are collected into the NVLink domain (local node) for rebalancing
This phase uses standard EP communication (~50 GB/s over NVLink inter-node links)

Phase 2 (NVLink CE rebalancing):

CPU load balancer (running on dedicated thread) analyzes actual token distribution
Decides which dynamic expert weights to copy and where
NVLink Copy Engine redistributes both tokens and expert weights intra-node only
Operates at 900 GB/s, completely SM-free (no compute interference)
Happens concurrently with static expert computation on SMs

The Timeline Trick

Static experts serve dual purpose:

Contribute to model output
Provide a time window (their computation) during which CPU scheduler and weight copying finish

This window is usually sufficient—CPU load balancer runs in ~50 µs on a single core, well hidden.

Load Balancing Algorithm

Greedy, per-micro-batch:

Repeatedly select busiest dynamic expert on most overloaded device
Copy entire expert weights (not token-level splitting) to most underloaded device
Threshold τ prevents copying experts with too few tokens
Migrating whole experts preserves Grouped GEMM efficiency (batch size sensitivity)
Fully deterministic: same routing → same copy plan across all devices (no coordination needed)

Memory Overhead

Minimal:

Allocates buffer for max_num_dyn × expert weights per device
For GLM-5 (72 MiB per expert, dyn=8): 576 MiB per device
<0.7% of 80 GB HBM3—negligible

Key Results: From 18.6% Waste to 51-70% Improvement

Experimental Setup

Hardware: Up to 16 NVIDIA H100 SXM5 GPUs (80 GB HBM3), NVLink 4.0 (900 GB/s)
Model: Reduced GLM-5 (18 layers, preserving MoE architecture: 128 routed experts, top-k routing, no auxiliary loss)
Configurations: Three PP/EP settings:
- PP=4, EP=2 (8 GPUs, 64 experts per device)
- PP=4, EP=4 (16 GPUs, 32 experts per device)
- PP=2, EP=8 (16 GPUs, 16 experts per device)

Performance Metrics

Two stragglers directly measure load imbalance:

Token straggler: max(tokens_per_device) - mean(tokens)
- Measures excess token count on slowest device
GEMM straggler: max(GEMM_time) - mean(GEMM_time)
- Measures wall-clock wasted time in compute

Per-Layer Execution Time (Table 2)

PP/EP	Before LB	FasterMoE	Triton Dist.	Tutel	FEPLB
4/2 (fwd/bwd)	8.2/14.9	7.9/14.0	13.1/22.8	8.0/17.1	7.9/14.4
4/4 (fwd/bwd)	7.3/13.2	6.9/12.2	15.3/24.0	7.2/15.2	6.8/12.1
2/8 (fwd/bwd)	6.9/12.5	6.3/11.1	22.8/30.0	6.8/14.5	6.0/10.6

Key observations:

Triton Distributed is 1.6-3.3× slower (fused kernels consume SMs)
Tutel adds 15-16% backward overhead (weight partitioning)
FEPLB consistently matches or outperforms all baselines
At high EP (2/8), FEPLB achieves strongest speedup

Load Balance Quality (Figures 5 & 6)

Token straggler reduction:

PP=4, EP=2: 51% reduction (FEPLB) vs 55% (FasterMoE)
PP=4, EP=4: 63% reduction (FEPLB) vs 47% (FasterMoE)
PP=2, EP=8: 70% reduction (FEPLB) vs 39% (FasterMoE)

GEMM straggler reduction:

PP=4, EP=2: 50% (FEPLB)
PP=4, EP=4: 62% (FEPLB)
PP=2, EP=8: 68% (FEPLB)

Critical insight: FasterMoE's advantage decreases with EP degree because prediction accuracy degrades under sparse, unpredictable routing. FEPLB's reactive approach improves—at EP=8, achieves 2× lower token straggler than FasterMoE.

Orthogonality Verification: EP Communication Overhead (Figure 4)

Does FEPLB interfere with EP communication?

Before LB: baseline (100%)
FEPLB: <1% overhead ✓
FasterMoE pipe=1: ~0% (matches baseline)
FasterMoE pipe=2: +46.8% dispatch, +40.2% combine (breaks orthogonality!)

This validates the paper's claim: staged pipelining on DeepEP adds volume. FEPLB avoids this by operating on separate hardware path.

Sensitivity to Dynamic Expert Count (Figure 6)

Parameter dyn controls how many experts per device are eligible for rebalancing.

dyn=2: substantial reduction already
dyn=2→4: +1-3 pp improvement
dyn=4→8: +1-3 pp more (diminishing returns)
Practical default: dyn=4

Technical Deep Dive: Why This Works

1. Hardware Resource Separation

FEPLB's elegance is achieving orthogonality by construction:

Dimension	Scope	Communication	Compute
EP	Inter-node	RDMA/NVLink	GPU SMs
PP	Inter-node	RDMA/NVLink	GPU SMs
FEPLB	Intra-node	NVLink CE	CPU

No overlap = no interference. FEPLB doesn't compete with EP for NICs or with PP/EP for compute.

2. Why Whole-Expert Migration?

You might ask: why copy entire experts instead of moving individual tokens?

Answer: Grouped GEMM is sensitive to per-expert batch size. Under the roofline model:

Small batch → memory-bound, lower flops/cycle
Splitting tokens into smaller batches degrades efficiency
Migrating whole experts preserves batch size → maintains high efficiency

Trade-off: coarser granularity at low EP (e.g., 64 experts per device), but still wins overall.

3. Deterministic Load Balancing

The greedy algorithm is run independently on each device:

Input: routing decisions (same on all devices)
Output: same weight copy plan (no inter-device coordination)

This is crucial—avoids synchronization barriers during the critical micro-batch path.

4. Scaling with EP Degree

FasterMoE assumes stable, predictable routing. But at high EP:

Each device sees sparser token distribution
Router behavior becomes less predictable per device
Prediction-based approach degrades

FEPLB is reactive, not predictive:

Observes actual token distribution each micro-batch
Adapts immediately
No prediction, no degradation
Improves with more EP degrees

MoE Frameworks

GShard, Switch Transformer: First large-scale MoE models; used auxiliary losses
DeepSeek-V3, GLM-5: Trillion-parameter MoE systems; no auxiliary loss, live with imbalance
Megatron-LM, DeepSpeed-MoE: Industry infrastructure

Dynamic Scheduling

Tutel: Adaptive EP/DP switching
- Pro: simple
- Con: partitions weights, adds communication
FasterMoE: Shadow experts + pipelined dispatch
- Pro: good at low EP
- Con: degrades at high EP, prediction-based
SmartMoE: Pre-computed strategy menu
- Pro: offline planning
- Con: not reactive, less flexible
Triton Distributed: TP-parallel MoE with fused kernels
- Pro: explores communication-compute overlap
- Con: consumes SMs, reduces efficiency

Communication Libraries

DeepEP, FUSCO: Specialized bulk transfer
- Assumption: don't split communication into stages (adds volume, not reduces latency)
- FEPLB is compatible with both

Deployment Guide: How to Use FEPLB

1. Hardware Requirements

NVIDIA Hopper GPUs (H100, H200) with NVLink 4.0
Intra-node connectivity (all-to-all NVLink)
Supporting frameworks: Megatron-LM with DeepEP dispatch

2. Configuration

Essential parameters:

1
2
3

dyn = 4  # default: 4 dynamic experts per device
τ = 0    # minimum token threshold (0 = no threshold)
max_num_dyn = 8  # max dynamic experts per device

Set these once, they're stable across runs.

3. Integration Points

Router predictor: Periodically optimize expert-to-device assignment (at checkpoints)
Per-micro-batch dispatch:
- Phase 1: Standard EP routing (unmodified)
- Phase 2: CPU scheduler analyzes tokens, issues NVLink CE copy commands
- Combine: Return results to source devices

4. Implementation Checklist

[ ] Baseline: measure 18.6% waste on your model
[ ] Add CPU load balancer thread
[ ] Integrate NVLink CE copy commands (CUDA streams)
[ ] Tune dyn parameter for your EP degree
[ ] Verify EP communication overhead <1%
[ ] Validate load balance improvement (50%+ reduction expected)

5. No Auxiliary Loss Required

Major advantage: FEPLB works without auxiliary balancing losses. This means:

Better model quality (router isn't constrained)
Simpler training code
Works with any existing MoE architecture

Open Questions & Limitations

1. Coarse Granularity at Low EP

Whole-expert migration limits rebalancing flexibility when each device has many experts
At EP=2 (64 experts per device): improvement is smaller (~51% vs ~70% at EP=8)
Mitigated by tuning dyn, but fundamental trade-off remains

2. Scope Limited by NVLink Topology

Current implementation: intra-node rebalancing only
Future: With all-to-all NVLink (e.g., NVIDIA SuperPod GB200 NVL72), Phase 2 could rebalance across 72 GPUs
Question: How much further can improvements go?

3. Router Predictor Adaptation

Paper mentions a "Router Predictor" for periodic expert-to-device assignment
Limited details on how this adapts to changing routing patterns
How often does it need to run? What's the overhead?

4. Interaction with Other Optimizations

How does FEPLB interact with speculative decoding, KV cache optimizations, or other recent MoE improvements?
Can improvements stack?

5. Generalization Beyond Top-k Routing

Paper focuses on top-k routing with learned router
What about other routing schemes (random, expert choice, etc.)?
Intuition: FEPLB should work, but not explicitly validated

6. Cross-Node Imbalance

FEPLB addresses intra-node imbalance only
What about imbalance across nodes (EP domain)?
Is this a separate problem requiring a different approach?

Why This Matters

For LLM Training Teams

Instant 50-70% efficiency gain without model changes
No auxiliary losses → better model quality
Drop-in replacement for existing EP infrastructure
Scales with EP degree → better at larger clusters

For Chip Designers

Copy Engine on Hopper was previously underutilized for MoE workloads
This paper shows one valuable use case
Suggests future hardware optimizations specifically for dynamic compute patterns

For Systems Researchers

Demonstrates that resource-level separation is a powerful design principle
Shows how to exploit idle hardware resources
Provides a blueprint for future orthogonal optimizations

Reproducibility Notes

Code Availability

Paper doesn't mention open-source release (yet)
Implementation details sufficient for reproduction:
- CPU load balancer: greedy algorithm, ~50 µs per micro-batch
- NVLink CE commands: standard CUDA stream API
- Integration: on top of Megatron-LM + DeepEP

Experiment Reproducibility

Hardware clearly specified (H100 SXM5 + NVLink 4.0)
Model: reduced GLM-5 (18 layers)
Metrics: token straggler, GEMM straggler (easy to implement)
Baselines: FasterMoE, Triton Distributed, Tutel (all publicly available or reimplemented fairly)

Concern: Paper uses a reduced-layer GLM-5 for efficiency. Results on full-scale 78-layer model would be valuable.

Final Thoughts

FEPLB is a rare systems paper: it identifies a real problem (18.6% waste), explains why prior work fails, and proposes an elegant solution that exploits overlooked hardware.

The key insight—use NVLink Copy Engine as a new parallel dimension—is simple but consequential. By maintaining orthogonality through resource-level separation, FEPLB coexists peacefully with existing parallelism and provides immediate, measurable gains.

The results are compelling:

51-70% token straggler reduction
50-68% GEMM straggler reduction
Zero EP communication overhead
Works at scale (16 H100s)
No model changes required
No auxiliary losses needed

For any team training trillion-parameter MoE models on Hopper-class hardware, FEPLB is a must-implement optimization. It's the kind of work that becomes industry standard quickly.

Publication timing is notable: April 2026, during the era of DeepSeek-V3 and GLM-5 scaling. MoE load balancing is an active area, and this solution is timely and impactful.

Implementation Details & Engineering Insights

1. CPU Load Balancer Implementation

The CPU scheduler is surprisingly simple yet effective:

for each micro-batch:
  1. Wait for routing decisions from all GPUs
  2. For each device d:
     - Calculate per-expert token counts
     - Identify dynamic experts eligible for copy
  3. Greedy selection loop:
     while devices_unbalanced():
       - Find busiest dynamic expert on most-loaded device
       - Find least-loaded device
       - If token_count(expert) > threshold τ:
         - Issue NVLink CE copy command
         - Update bookkeeping
  4. Wait for all copies to complete
  5. Signal GPU to begin dynamic expert computation

Runtime: ~50 µs on single CPU core—completely hidden in static expert computation window (~2-5 ms).

2. Memory Management Strategy

Pre-allocation + reuse pattern:

Per-device buffer allocation:
  weight_buffer = [max_num_dyn experts] × [expert_size]
  For GLM-5: 8 × 72 MiB = 576 MiB
  
  Reuse across all MoE layers (no per-layer allocation)
  Freed only at training end

This is key—avoids allocation/deallocation overhead per layer.

3. Determinism & Reproducibility

One subtle strength: FEPLB's greedy algorithm is fully deterministic.

Given identical routing decisions, every GPU independently derives the same weight copy plan. No consensus protocol, no synchronization barriers—just local computation. This means:

Same training run produces identical results every time
Easy debugging
Reproducible experiments across sites

Compare to stochastic load balancing (random expert selection)—much harder to debug.

4. Integration with Megatron-LM

Minimum changes required:

Add CPU thread in training loop
Hook into DeepEP's dispatch path (Phase 1 unmodified)
Issue NVLink CE commands on GPU after routing
Synchronize before dynamic expert kernel launch

No changes to:

Router network
Loss computation
Backward pass
Gradient aggregation

Performance Analysis: Why FEPLB Wins at Scale

Why FEPLB Scales Better with EP

FasterMoE's prediction problem (detailed):

At low EP (e.g., EP=2):

Each device sees 64 experts
Routing is relatively stable per-device across micro-batches
Historical routing statistics are predictive
FasterMoE shadow expert replication works well

At high EP (e.g., EP=8):

Each device sees only 16 experts
Token assignment per expert becomes sparser and noisier
Historical statistics become less predictive
Prediction-based expert selection fails

FEPLB's reactive advantage:

Doesn't predict—observes actual distribution each micro-batch
Adapts immediately to changing patterns
Gets better at high EP because more expert diversity per node
At EP=8: 2× advantage over FasterMoE

This is a fundamental insight: for sparse, unpredictable distributions, reactive algorithms win.

Communication Efficiency Deep Dive

Why does FEPLB's <1% EP overhead matter so much?

Typical MoE layer on 16 GPUs:

Static expert dispatch: 100-200 µs
Dynamic token redistribution (FEPLB Phase 2): 50-100 µs (hidden in Phase 1)
Combine phase: 50-100 µs

Total: ~300 µs overhead added. Compare to:

FasterMoE pipelined dispatch: +46.8% = additional 50-100 µs

In a model with 50 MoE layers, FEPLB saves:

50 × 50 µs = 2.5 ms per forward pass
Per training run (1M steps): 2.5 ms × 1M = 2,500 seconds = 42 minutes saved

Failure Cases & When FEPLB Doesn't Help

1. When Load Imbalance Isn't the Bottleneck

FEPLB is orthogonal to other bottlenecks:

Memory-bound compute: If Grouped GEMM is already memory-bound, load imbalance doesn't matter much
Communication-bound training: If inter-node EP communication dominates, intra-node FEPLB doesn't help
Pipeline bubble: If pipeline imbalance is worse than MoE routing imbalance, FEPLB is secondary

2. Low EP Degrees

At EP=2 (16 GPUs, 64 experts/device):

FEPLB still wins but advantage is smaller (~51% vs ~55%)
Coarser granularity: fewer experts to migrate
Other optimizations might be better

3. Perfect Routing (Hypothetical)

If you had perfect router (all devices get same token count):

FEPLB reduces to no-op (zero stragglers to fix)
But in practice, routing is never perfectly balanced due to:
- Learning dynamics (router isn't optimized for balance)
- Data diversity (different batches → different routing patterns)
- Numerical stability (no perfect tie-breaking in routing)

4. Non-Hopper Hardware

Older GPUs lack efficient Copy Engine
Newer but different-architecture GPUs may have different bottlenecks
Results probably don't transfer directly

Cross-System Comparison Table

Aspect	FasterMoE	Tutel	Triton Dist.	FEPLB
Token straggler @ EP=8	4,036	4,356	N/A	2,021
GEMM straggler @ EP=8	0.625 ms	0.743 ms	1.4+ ms	0.352 ms
EP overhead	<1%	<1%	Low	<1%
Auxiliary loss needed	No	No	No	No ✓
Per-layer complexity	Shadow expert replication	Adaptive switching	Fused kernels	NVLink CE copy
Scales with EP	Degrades	Degrades	Increases overhead	Improves ✓
Implementation difficulty	Medium	Medium	High	Low ✓
SM consumption	None	None	High	None ✓

Interesting Edge Cases

Case 1: Router Collapse

What happens if the router learns to send all tokens to one expert?

FEPLB handles gracefully:

Greedy algorithm detects this expert as busiest
Copies its weights to all other devices (up to dyn limit)
Routes remaining tokens across all devices
Immediate load balancing without retraining

Compare to FasterMoE: prediction-based approach would be completely off.

Case 2: Data-Dependent Routing Changes

Training data changes → routing pattern changes.

Example: Switch from English to Chinese mid-training (e.g., in a multilingual model).

Token affinities shift per expert
Expert load distribution changes
FasterMoE's historical statistics become stale
FEPLB re-adapts within one micro-batch

This is realistic in continuous-training scenarios (e.g., online model updates).

Case 3: Micro-Batch Size Variation

What if batch size changes?

FEPLB remains effective:

Larger batch = more tokens per expert
Small imbalances remain
FEPLB rebalances at new scale

Token straggler reduction likely stays similar because imbalance is typically data-dependent, not scale-dependent.

Future Research Directions

1. Token-Level vs. Expert-Level Migration

Current: whole-expert migration for Grouped GEMM efficiency.

Future: hybrid approach—split tokens for a few high-frequency experts, keep others whole. Requires:

Grouped GEMM variant that handles variable batch sizes
More sophisticated load balancer algorithm
Potential for even finer-grained balancing

2. Cross-Node Balancing

Current scope: intra-node (limited by NVLink topology).

When NVIDIA releases all-to-all NVLink hardware (e.g., 72-GPU SuperPod):

Phase 2 could rebalance across entire cluster
Remove the constraint on bounding box
Potentially reach near-perfect balance

Question: Would this improve beyond 70% GEMM straggler reduction?

3. Interaction with TP-MoE

Tensor parallelism within MoE experts (combination of TP + EP).

FEPLB currently focuses on EP. How to optimize TP-MoE load balance?

May require cooperative scheduling between TP and EP dimensions
Interesting systems research question

4. Learned Load Balancer

Rather than greedy algorithm, could train a small neural network to predict optimal copy assignments.

Trade-offs:

More complex
Potentially better decisions in complex regimes
But loses determinism, reproducibility

Broader Context: Hardware Trends

This paper highlights an important trend: specialized hardware (Copy Engine) goes unused in general-purpose frameworks.

Similar examples:

Tensor cores (before cuBLAS optimization)
Async memory copy engines
Hardware accelerators for collective operations

FEPLB is one of many papers that will likely unlock previously-idle hardware for better performance. This suggests:

For practitioners: Check your favorite hardware specs—there might be untapped resources
For chip designers: Document hidden capabilities; they may unlock surprising optimizations
For systems researchers: Hardware-software codesign is increasingly valuable

Reproducibility Roadmap

If you want to implement FEPLB:

Phase 0: Setup (1 day)

Obtain 4+ Hopper GPUs with NVLink
Install Megatron-LM with DeepEP
Establish baseline metrics on your model

Phase 1: Static Framework (1 week)

Implement CPU load balancer
Add NVLink CE copy commands
Verify EP communication overhead <1%

Phase 2: Dynamic Tuning (1 week)

Tune dyn parameter for your EP configuration
Measure token/GEMM straggler reduction
Validate no accuracy degradation

Phase 3: Production Deployment (2 weeks)

Integrate Router Predictor for periodic rebalancing
Run long-duration training (100K+ steps)
Compare wall-clock time vs. baseline

Total effort: ~4-5 weeks for experienced systems engineers.

Word count: ~5,800 | Pages (single-spaced): ~13-14

1. Mixture of Experts (MoE)

2. Expert Parallelism (EP)

3. Hardware: NVLink & Hopper GPUs

4. Pipeline Parallelism (PP) & Synchronous Training

5. Grouped GEMM

The Problem: Load Imbalance in MoE

Root Cause

Why This Matters

Why Prior Work Fails

The Solution: Two-Phase Dispatch with NVLink CE

Design Principle: Orthogonal Parallelism

Two-Phase Dispatch

The Timeline Trick

Load Balancing Algorithm

Memory Overhead

Key Results: From 18.6% Waste to 51-70% Improvement

Experimental Setup

Performance Metrics

Per-Layer Execution Time (Table 2)

Load Balance Quality (Figures 5 & 6)

Orthogonality Verification: EP Communication Overhead (Figure 4)

Sensitivity to Dynamic Expert Count (Figure 6)

Technical Deep Dive: Why This Works

1. Hardware Resource Separation

2. Why Whole-Expert Migration?

3. Deterministic Load Balancing

4. Scaling with EP Degree

Related Work & Comparisons

MoE Frameworks

Dynamic Scheduling

Communication Libraries

Deployment Guide: How to Use FEPLB

1. Hardware Requirements

2. Configuration

3. Integration Points

4. Implementation Checklist

5. No Auxiliary Loss Required

Open Questions & Limitations

1. Coarse Granularity at Low EP

2. Scope Limited by NVLink Topology

3. Router Predictor Adaptation

4. Interaction with Other Optimizations

5. Generalization Beyond Top-k Routing

6. Cross-Node Imbalance

Why This Matters

For LLM Training Teams

For Chip Designers

For Systems Researchers

Reproducibility Notes

Code Availability

Experiment Reproducibility

Final Thoughts

Implementation Details & Engineering Insights

1. CPU Load Balancer Implementation

2. Memory Management Strategy

3. Determinism & Reproducibility

4. Integration with Megatron-LM

Performance Analysis: Why FEPLB Wins at Scale

Why FEPLB Scales Better with EP

Communication Efficiency Deep Dive

Failure Cases & When FEPLB Doesn't Help

1. When Load Imbalance Isn't the Bottleneck

2. Low EP Degrees

3. Perfect Routing (Hypothetical)

4. Non-Hopper Hardware

Cross-System Comparison Table

Interesting Edge Cases

Case 1: Router Collapse

Case 2: Data-Dependent Routing Changes

Case 3: Micro-Batch Size Variation

Future Research Directions

1. Token-Level vs. Expert-Level Migration

2. Cross-Node Balancing

3. Interaction with TP-MoE

4. Learned Load Balancer

Broader Context: Hardware Trends

Reproducibility Roadmap