0%

技术评审:流水线并行如何让巨型神经网络训练成为可能

作者: Zhongzhu Zhou
日期: 2026-04-02
论文标题: GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
原作者: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen
机构: Google Brain
发表于: NeurIPS 2019
ArXiv ID: 1811.06965


核心摘要与贡献

2019 年,Google Brain 团队发表了 GPipe 这篇论文,解决了深度学习领域一个极为关键的工程问题:当一个神经网络模型大到单张 GPU/TPU 装不下时,怎样才能在多张加速卡上高效地训练它?

在 GPipe 出现之前,跨设备训练大模型需要针对每种网络架构编写专用的分布式代码,这既费时又容易出错。GPipe 提出了一种通用的流水线并行方案,适用于任何可以表示为层序列的神经网络,从根本上降低了训练巨型模型的工程门槛。

GPipe 的核心贡献包括:

  1. 微批次流水线并行算法:将小批量(mini-batch)拆分成更小的微批次(micro-batch),在多张加速卡之间形成流水线执行,接近线性加速。
Read more »

Technical Review: Pipeline Parallelism for Training Giant Neural Networks

Author: Zhongzhu Zhou
Date: 2026-04-02
Paper Title: GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Original Authors: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen
Affiliation: Google Brain
Published at: NeurIPS 2019
ArXiv ID: 1811.06965


Executive Summary & Core Contributions

GPipe is a foundational library for pipeline parallelism that solved one of the hardest practical problems in deep learning in 2019: how do you train a neural network that is too large to fit on a single accelerator, without writing architecture-specific distributed code? Before GPipe, scaling a model across multiple GPUs or TPUs required custom engineering for each architecture. GPipe changed this by providing a general-purpose pipeline parallelism framework applicable to any network expressible as a sequence of layers.

Read more »

Technical Review: "The Unreasonable Ineffectiveness of the Deeper Layers"

Author: Zhongzhu Zhou
Date: 2026-04-01
Paper Title: The Unreasonable Ineffectiveness of the Deeper Layers
Original Authors: Andrey Gromov, Paolo Glorioso, Kushal Tirumala, Hassan Shapourian, Daniel A. Roberts
Published at: ICLR 2025
ArXiv ID: 2403.17887


Executive Summary & Core Contributions

This paper presents a striking empirical finding: up to 50% of the deeper layers in large language models like LLaMA-2-70B can be removed with minimal degradation in downstream task performance. Through systematic layer pruning experiments, the authors challenge the widely-held assumption that all layers in a deep neural network contribute meaningfully to the model's decision-making process.

Read more »

1. What This Paper Does

Imagine you want to teach a child to behave well. The traditional approach is to have a human supervisor watch the child's every action and correct them—but this is incredibly expensive and doesn't scale. What if instead, you could write down a set of rules (a "constitution") and train the child to self-correct based on those rules, with another well-behaved child helping to evaluate behavior? That is essentially what Constitutional AI (CAI) does for language models.

Bai et al. from Anthropic introduce Constitutional AI, a method for training AI assistants to be helpful, harmless, and honest (HHH) without requiring any human feedback labels for harmlessness. Instead of relying on tens of thousands of human annotations to identify harmful content, CAI uses a small set of natural language principles—a "constitution"—to guide the model's self-improvement. The process has two stages: (1) a supervised learning (SL) stage where the model critiques and revises its own harmful responses, and (2) a reinforcement learning (RL) stage where the model generates its own preference labels for harmlessness (called "RL from AI Feedback" or RLAIF), which are then used to train a reward model and fine-tune the policy.

Read more »

1. What This Paper Does

Imagine you ask a very smart assistant to solve a math word problem. If you just say "answer this," the assistant might blurt out a number without thinking it through—and often get it wrong. But if you first show the assistant a few examples of how to think step by step, suddenly it can solve much harder problems. That is the core insight of this paper.

Wei et al. introduce chain-of-thought (CoT) prompting, a remarkably simple technique: instead of giving a language model plain input-output examples in a few-shot prompt, you include intermediate reasoning steps—a "chain of thought"—in each example. The model then learns to produce its own chain of thought before arriving at an answer. No fine-tuning, no new training data, no architectural changes—just a different way of writing your prompt.

Read more »

1. What This Paper Does

Ring Attention solves one of the most stubborn problems in modern deep learning: the memory wall that prevents Transformers from processing long sequences. Even with memory-efficient attention (FlashAttention) and blockwise computation, the output activations of each Transformer layer must be stored and have size proportional to the sequence length. For 100 million tokens with hidden size 1024, that alone exceeds 1,000 GB — far beyond any single GPU or TPU.

The key insight is elegant: if you compute self-attention in a blockwise fashion (block-by-block), the order in which you process key-value blocks does not matter, as long as you combine the statistics correctly. This permutation invariance means you can place devices in a ring, have each device hold one query block, and rotate key-value blocks around the ring. While a device computes attention against the current key-value block, it simultaneously sends that block to the next device and receives a new block from the previous device. If the computation takes longer than the communication, the communication is completely hidden — zero overhead.

Read more »

1. What This Paper Does

Mamba introduces a selective state space model (selective SSM) that, for the first time, achieves Transformer-quality performance on language modeling while scaling linearly in sequence length — not quadratically like standard attention. The key insight is deceptively simple: make the parameters of a state space model depend on the input, so the model can choose what to remember and what to forget at each timestep. This seemingly small change has profound consequences: it breaks the mathematical equivalence between SSMs and convolutions that prior work relied on for efficiency, forcing the authors to invent a new hardware-aware parallel algorithm. The result is Mamba, a clean architecture with no attention and no MLP blocks, that runs at 5× the inference throughput of Transformers and matches or exceeds their quality across language, audio, and genomics.

Read more »

Executive Summary

This paper addresses one of the most critical bottlenecks in modern large language model training: optimizer state memory consumption. While most practitioners focus on reducing parameter count through methods like LoRA, GaLore takes a different approach by attacking the actual memory-dominant term—the first and second moment estimates maintained by optimizers like AdamW.

The key innovation is elegant: instead of forcing weights into low-rank spaces (which constrains model expressivity), GaLore exploits the observation that gradient matrices naturally exhibit low-rank structure during training. By decomposing gradients, performing optimizer updates in the compressed rank-r space, and projecting updates back to full rank, the method achieves:

Read more »

Abstract

Training extremely large deep learning models — with billions or even hundreds of billions of parameters — on distributed GPU clusters is one of the most challenging engineering problems in modern machine learning. Today, achieving high throughput requires manually combining multiple forms of parallelism (data, tensor/operator, pipeline) in ways that are specific to both the model architecture and the cluster topology. This manual process demands deep systems expertise and does not generalize across different models or hardware setups. Alpa is a compiler system that automates this entire process. It introduces a new hierarchical view of parallelism — distinguishing between intra-operator and inter-operator parallelism — and uses a combination of Integer Linear Programming (ILP) and Dynamic Programming (DP) to automatically generate near-optimal execution plans. Evaluated on GPT-3, GShard MoE, and Wide-ResNet at up to 64 GPUs, Alpa matches or outperforms hand-tuned systems like Megatron-LM and DeepSpeed, and generalizes to models that have no manually-designed parallelization strategies at all.

Read more »

Table of Contents

  1. Introduction: Why Model Quantization Matters
  2. Prerequisites: Background Knowledge You Need
  3. The GPTQ Method: Three Key Insights
  4. The Full Algorithm
  5. Experimental Results and Analysis
  6. Practical Speedups and Deployment
  7. Extreme Quantization and Grouping
  8. Limitations and Discussion
  9. Conclusion and Impact

1. Introduction: Why Model Quantization Matters

Consider the practical challenge of deploying a state-of-the-art large language model. GPT-3, with its 175 billion parameters, requires 326 GB of memory when stored in the compact FP16 (16-bit floating point) format. This exceeds the capacity of even the most powerful single GPU available (NVIDIA A100 with 80 GB), meaning you need at least 5 GPUs just for inference — not training, just running the model to generate text.

Read more »

Abstract

Reinforcement learning (RL) has become a cornerstone of modern AI — from training robots to walk, to aligning large language models with human preferences via RLHF. At the heart of many of these breakthroughs lies a deceptively simple algorithm called Proximal Policy Optimization (PPO), introduced by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI in 2017.

PPO proposes a new family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO introduces a novel clipped objective function that enables multiple epochs of minibatch updates without catastrophically large policy changes. The result is an algorithm that inherits the stability of Trust Region Policy Optimization (TRPO) while being dramatically simpler to implement — requiring only a few lines of code change to a vanilla policy gradient implementation.

Read more »

摘要

强化学习(RL)已成为现代 AI 的基石 —— 从训练机器人行走,到通过 RLHF 将大语言模型与人类偏好对齐。这些突破的核心,是一个看似简单却影响深远的算法 —— 近端策略优化(Proximal Policy Optimization,PPO),由 John Schulman、Filip Wolski、Prafulla Dhariwal、Alec Radford 和 Oleg Klimov 于 2017 年在 OpenAI 提出。

PPO 提出了一族新的策略梯度方法,在与环境交互采样数据和使用随机梯度上升优化"替代"目标函数之间交替进行。与标准策略梯度方法每个数据样本只做一次梯度更新不同,PPO 引入了一种新颖的裁剪目标函数,使得同一批数据可以进行多轮小批量更新,而不会导致灾难性的策略大幅变化。其结果是:PPO 继承了信赖域策略优化(TRPO)的稳定性,同时实现大幅简化 —— 相比 vanilla 策略梯度只需修改几行代码。

实验结果表明 PPO 在广泛的基准测试中表现出色:连续控制任务(MuJoCo)、复杂的 3D 人形运动控制(Roboschool)以及 Atari 游戏。PPO 在采样效率、简洁性和运行时间之间实现了出色的平衡。

为什么这篇论文今天仍然重要? PPO 可以说是深度学习时代最有影响力的 RL 算法。它成为了机器人训练、游戏 AI 的默认算法,更为关键的是 —— 它是 RLHF 流水线的核心优化引擎,ChatGPT、Claude、Gemini 等大语言模型的人类偏好对齐都依赖于它。理解 PPO 是从事 AI 对齐、LLM 训练或现代 RL 系统研究的必备前置知识。

Read more »