0%

1. Why This Paper Matters

If you only remember one sentence from this review, I want it to be this:

SmoothQuant is important because it turns a seemingly annoying numerical issue—activation outliers—into a clean systems trick that real hardware can actually use.

Large language models are expensive for two reasons:

  1. they store a huge amount of weights, and
  2. they repeatedly move those weights and activations through matrix multiplications.

That means memory footprint, memory bandwidth, and integer-kernel friendliness are not side details. They are central engineering constraints.

Read more »

1. 这篇论文为什么重要

如果你只记得这个评审里的一句话,我希望是这句:

SmoothQuant 之所以重要,在于它把一个看似烦人的数值问题——激活值离群值——转变成了一个真实硬件能实际使用的系统技巧。

大型语言模型的部署成本源于两个原因:

  1. 模型存储的权重数量巨大,
  2. 这些权重和激活值要反复通过矩阵乘法运算。

这意味着内存占用、内存带宽、以及整数核心兼容性不是边缘细节。它们是中心工程约束。

Read more »

1. What This Paper Does

Let me start in the simplest possible way.

Imagine you are teaching a student to answer questions politely. You show the student two answers to the same question:

  • one answer is good, helpful, and safe;
  • the other answer is bad, vague, or annoying.

Now you want the student to learn both of these lessons at the same time:

  1. Please imitate the good answer.
  2. Please stop sounding like the bad answer.

A surprising amount of alignment work does not do these two things in one clean step. The standard modern pipeline often looks like this:

  • start from a pretrained model;
  • run supervised fine-tuning (SFT) so it learns the target domain;
  • train a reward model or hold a reference model;
  • then run a second-stage preference optimization method such as RLHF or DPO.

This paper asks a very practical question:

Can we collapse that multi-stage pipeline into one simpler training objective without losing the benefit of preference learning?

The answer proposed by the paper is ORPO: Odds Ratio Preference Optimization.

Read more »

1. 先用最通俗的话说:这篇论文到底在干什么?

我先不用公式,先讲一个“老爷爷老奶奶也能立刻抓住”的版本。

你在教一个孩子回答问题。每次你给他两份答案:

  • 一份是“更好”的答案(更有礼貌、更准确、更符合要求);
  • 一份是“更差”的答案(啰嗦、跑题、格式不对、甚至不安全)。

你希望孩子同时学到两件事:

  1. 多学好的答案;
  2. 少学坏的答案。

听起来很简单,对吧?

但很多主流对齐流程并不是一步做完,而是多阶段:

  • 先做 SFT(监督微调);
  • 再做 DPO 或 RLHF 这类偏好优化;
  • 其中常常还要多一个“参考模型”或“奖励模型”;
  • 工程链路变长,显存和算力开销变大。

这篇论文提出一个非常务实的目标:

能不能把“学会好答案 + 压制坏答案”合到同一个训练目标里,而且不依赖参考模型?

它给出的答案就是 ORPO(Odds Ratio Preference Optimization)

Read more »

1. What This Paper Does

Imagine you have a massive library with thousands of specialist librarians, each an expert in a different topic. When you walk in with a question about, say, ancient Roman aqueducts, instead of asking every single librarian (which would take forever), you are instantly directed to the one librarian who knows the most about Roman engineering. You get an expert-quality answer in the time it would take to ask just one person. That is the core idea behind the Switch Transformer.

In the world of deep learning, conventional models use all of their parameters to process every input. A model with 11 billion parameters applies all 11 billion weights to every single word it reads. The Switch Transformer flips this on its head: it has an enormous number of parameters—up to 1.6 trillion—but for any given input token, only a small fraction of those parameters are activated. The rest sit idle, waiting for inputs that need their particular expertise.

Read more »

1. 这篇论文做了什么

想象一下:你有一座巨大的图书馆,里面有上千位专业图书管理员,每位都是不同领域的专家。当你带着一个关于古罗马水渠的问题走进去时,你不需要问遍每一位管理员(那太花时间了),而是被一个智能导航系统瞬间引导到最了解罗马工程学的那位管理员面前。你用问一个人的时间就得到了专家级别的答案。这就是 Switch Transformer 的核心思想。

在深度学习的世界里,传统模型用全部参数来处理每一个输入。一个拥有 110 亿参数的模型,对它读到的每一个词都会动用全部 110 亿个权重。Switch Transformer 彻底颠覆了这种做法:它拥有海量参数——最多达到 1.6 万亿——但对于任意一个输入的 token,只有一小部分参数被激活。其余的参数处于休眠状态,等待需要它们特定专长的输入。

这篇论文由 Google 的 William Fedus、Barret Zoph 和 Noam Shazeer 撰写,提出了 Switch Transformer 架构。它简化了早期的混合专家(Mixture-of-Experts, MoE)方法,核心改动是把每个 token 的路由目标从两个或更多专家减少到仅仅一个专家。这看似微小的改变带来了巨大的实践好处:更低的计算开销、更简单的实现、更少的通信成本和更好的训练稳定性。

Read more »

1. The Problem: LLMs Are Too Big for Your Phone

Imagine you want to run ChatGPT-level AI on your phone, completely offline — no internet, no cloud server, total privacy. Sounds great, right? The problem is that modern large language models (LLMs) are enormous.

GPT-3, released in 2020, has 175 billion parameters. Each parameter is typically stored as a 16-bit floating-point number (FP16), meaning it occupies 2 bytes of memory. Total memory for GPT-3: 350 GB. The best consumer GPUs have 24 GB of memory. Your phone has maybe 8-12 GB total RAM. Clearly, something has to give.

Read more »

1. 背景故事:为什么大模型太大了?

想象一下,你希望在自己的手机上运行 ChatGPT 级别的 AI——完全离线,不需要网络,不需要云服务器,数据完全保留在本地。这个想法很美好,但现实是:现代大型语言模型(LLM,Large Language Model)体积巨大无比

以 GPT-3 为例:它有 1750 亿个参数。每个参数通常用 16 位浮点数(FP16)存储,每个参数占 2 个字节。算下来,GPT-3 的存储需求是:

1750×108×2 字节=350 GB1750 \times 10^8 \times 2 \text{ 字节} = 350 \text{ GB}

而最顶级的消费级显卡(如 NVIDIA RTX 4090)只有 24 GB 显存,你的手机可能只有 8-12 GB 的总内存。显然,这个问题必须解决。

即使是"小一点"的模型,比如 LLaMA-2-7B(70 亿参数),也需要约 14 GB 的 FP16 存储空间,勉强能放进高端 GPU,但对手机来说完全不可能。

Read more »

技术评审:流水线并行如何让巨型神经网络训练成为可能

作者: Zhongzhu Zhou
日期: 2026-04-02
论文标题: GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
原作者: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen
机构: Google Brain
发表于: NeurIPS 2019
ArXiv ID: 1811.06965


核心摘要与贡献

2019 年,Google Brain 团队发表了 GPipe 这篇论文,解决了深度学习领域一个极为关键的工程问题:当一个神经网络模型大到单张 GPU/TPU 装不下时,怎样才能在多张加速卡上高效地训练它?

在 GPipe 出现之前,跨设备训练大模型需要针对每种网络架构编写专用的分布式代码,这既费时又容易出错。GPipe 提出了一种通用的流水线并行方案,适用于任何可以表示为层序列的神经网络,从根本上降低了训练巨型模型的工程门槛。

GPipe 的核心贡献包括:

  1. 微批次流水线并行算法:将小批量(mini-batch)拆分成更小的微批次(micro-batch),在多张加速卡之间形成流水线执行,接近线性加速。
Read more »

Technical Review: Pipeline Parallelism for Training Giant Neural Networks

Author: Zhongzhu Zhou
Date: 2026-04-02
Paper Title: GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Original Authors: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen
Affiliation: Google Brain
Published at: NeurIPS 2019
ArXiv ID: 1811.06965


Executive Summary & Core Contributions

GPipe is a foundational library for pipeline parallelism that solved one of the hardest practical problems in deep learning in 2019: how do you train a neural network that is too large to fit on a single accelerator, without writing architecture-specific distributed code? Before GPipe, scaling a model across multiple GPUs or TPUs required custom engineering for each architecture. GPipe changed this by providing a general-purpose pipeline parallelism framework applicable to any network expressible as a sequence of layers.

Read more »

Technical Review: "The Unreasonable Ineffectiveness of the Deeper Layers"

Author: Zhongzhu Zhou
Date: 2026-04-01
Paper Title: The Unreasonable Ineffectiveness of the Deeper Layers
Original Authors: Andrey Gromov, Paolo Glorioso, Kushal Tirumala, Hassan Shapourian, Daniel A. Roberts
Published at: ICLR 2025
ArXiv ID: 2403.17887


Executive Summary & Core Contributions

This paper presents a striking empirical finding: up to 50% of the deeper layers in large language models like LLaMA-2-70B can be removed with minimal degradation in downstream task performance. Through systematic layer pruning experiments, the authors challenge the widely-held assumption that all layers in a deep neural network contribute meaningfully to the model's decision-making process.

Read more »

1. What This Paper Does

Imagine you want to teach a child to behave well. The traditional approach is to have a human supervisor watch the child's every action and correct them—but this is incredibly expensive and doesn't scale. What if instead, you could write down a set of rules (a "constitution") and train the child to self-correct based on those rules, with another well-behaved child helping to evaluate behavior? That is essentially what Constitutional AI (CAI) does for language models.

Bai et al. from Anthropic introduce Constitutional AI, a method for training AI assistants to be helpful, harmless, and honest (HHH) without requiring any human feedback labels for harmlessness. Instead of relying on tens of thousands of human annotations to identify harmful content, CAI uses a small set of natural language principles—a "constitution"—to guide the model's self-improvement. The process has two stages: (1) a supervised learning (SL) stage where the model critiques and revises its own harmful responses, and (2) a reinforcement learning (RL) stage where the model generates its own preference labels for harmlessness (called "RL from AI Feedback" or RLAIF), which are then used to train a reward model and fine-tune the policy.

Read more »