0%

1. Why This Paper Matters

If you only remember one sentence from this review, I want it to be this:

SmoothQuant is important because it turns a seemingly annoying numerical issue—activation outliers—into a clean systems trick that real hardware can actually use.

Large language models are expensive for two reasons:

  1. they store a huge amount of weights, and
  2. they repeatedly move those weights and activations through matrix multiplications.

That means memory footprint, memory bandwidth, and integer-kernel friendliness are not side details. They are central engineering constraints.

Read more »

1. What This Paper Does

Let me start in the simplest possible way.

Imagine you are teaching a student to answer questions politely. You show the student two answers to the same question:

  • one answer is good, helpful, and safe;
  • the other answer is bad, vague, or annoying.

Now you want the student to learn both of these lessons at the same time:

  1. Please imitate the good answer.
  2. Please stop sounding like the bad answer.

A surprising amount of alignment work does not do these two things in one clean step. The standard modern pipeline often looks like this:

  • start from a pretrained model;
  • run supervised fine-tuning (SFT) so it learns the target domain;
  • train a reward model or hold a reference model;
  • then run a second-stage preference optimization method such as RLHF or DPO.

This paper asks a very practical question:

Can we collapse that multi-stage pipeline into one simpler training objective without losing the benefit of preference learning?

The answer proposed by the paper is ORPO: Odds Ratio Preference Optimization.

Read more »

1. What This Paper Does

Imagine you have a massive library with thousands of specialist librarians, each an expert in a different topic. When you walk in with a question about, say, ancient Roman aqueducts, instead of asking every single librarian (which would take forever), you are instantly directed to the one librarian who knows the most about Roman engineering. You get an expert-quality answer in the time it would take to ask just one person. That is the core idea behind the Switch Transformer.

In the world of deep learning, conventional models use all of their parameters to process every input. A model with 11 billion parameters applies all 11 billion weights to every single word it reads. The Switch Transformer flips this on its head: it has an enormous number of parameters—up to 1.6 trillion—but for any given input token, only a small fraction of those parameters are activated. The rest sit idle, waiting for inputs that need their particular expertise.

Read more »

1. The Problem: LLMs Are Too Big for Your Phone

Imagine you want to run ChatGPT-level AI on your phone, completely offline — no internet, no cloud server, total privacy. Sounds great, right? The problem is that modern large language models (LLMs) are enormous.

GPT-3, released in 2020, has 175 billion parameters. Each parameter is typically stored as a 16-bit floating-point number (FP16), meaning it occupies 2 bytes of memory. Total memory for GPT-3: 350 GB. The best consumer GPUs have 24 GB of memory. Your phone has maybe 8-12 GB total RAM. Clearly, something has to give.

Read more »

技术评审:流水线并行如何让巨型神经网络训练成为可能

作者: Zhongzhu Zhou
日期: 2026-04-02
论文标题: GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
原作者: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen
机构: Google Brain
发表于: NeurIPS 2019
ArXiv ID: 1811.06965


核心摘要与贡献

2019 年,Google Brain 团队发表了 GPipe 这篇论文,解决了深度学习领域一个极为关键的工程问题:当一个神经网络模型大到单张 GPU/TPU 装不下时,怎样才能在多张加速卡上高效地训练它?

在 GPipe 出现之前,跨设备训练大模型需要针对每种网络架构编写专用的分布式代码,这既费时又容易出错。GPipe 提出了一种通用的流水线并行方案,适用于任何可以表示为层序列的神经网络,从根本上降低了训练巨型模型的工程门槛。

GPipe 的核心贡献包括:

  1. 微批次流水线并行算法:将小批量(mini-batch)拆分成更小的微批次(micro-batch),在多张加速卡之间形成流水线执行,接近线性加速。
Read more »

Technical Review: Pipeline Parallelism for Training Giant Neural Networks

Author: Zhongzhu Zhou
Date: 2026-04-02
Paper Title: GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Original Authors: Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen
Affiliation: Google Brain
Published at: NeurIPS 2019
ArXiv ID: 1811.06965


Executive Summary & Core Contributions

GPipe is a foundational library for pipeline parallelism that solved one of the hardest practical problems in deep learning in 2019: how do you train a neural network that is too large to fit on a single accelerator, without writing architecture-specific distributed code? Before GPipe, scaling a model across multiple GPUs or TPUs required custom engineering for each architecture. GPipe changed this by providing a general-purpose pipeline parallelism framework applicable to any network expressible as a sequence of layers.

Read more »

Technical Review: "The Unreasonable Ineffectiveness of the Deeper Layers"

Author: Zhongzhu Zhou
Date: 2026-04-01
Paper Title: The Unreasonable Ineffectiveness of the Deeper Layers
Original Authors: Andrey Gromov, Paolo Glorioso, Kushal Tirumala, Hassan Shapourian, Daniel A. Roberts
Published at: ICLR 2025
ArXiv ID: 2403.17887


Executive Summary & Core Contributions

This paper presents a striking empirical finding: up to 50% of the deeper layers in large language models like LLaMA-2-70B can be removed with minimal degradation in downstream task performance. Through systematic layer pruning experiments, the authors challenge the widely-held assumption that all layers in a deep neural network contribute meaningfully to the model's decision-making process.

Read more »

1. What This Paper Does

Imagine you want to teach a child to behave well. The traditional approach is to have a human supervisor watch the child's every action and correct them—but this is incredibly expensive and doesn't scale. What if instead, you could write down a set of rules (a "constitution") and train the child to self-correct based on those rules, with another well-behaved child helping to evaluate behavior? That is essentially what Constitutional AI (CAI) does for language models.

Bai et al. from Anthropic introduce Constitutional AI, a method for training AI assistants to be helpful, harmless, and honest (HHH) without requiring any human feedback labels for harmlessness. Instead of relying on tens of thousands of human annotations to identify harmful content, CAI uses a small set of natural language principles—a "constitution"—to guide the model's self-improvement. The process has two stages: (1) a supervised learning (SL) stage where the model critiques and revises its own harmful responses, and (2) a reinforcement learning (RL) stage where the model generates its own preference labels for harmlessness (called "RL from AI Feedback" or RLAIF), which are then used to train a reward model and fine-tune the policy.

Read more »

1. What This Paper Does

Imagine you ask a very smart assistant to solve a math word problem. If you just say "answer this," the assistant might blurt out a number without thinking it through—and often get it wrong. But if you first show the assistant a few examples of how to think step by step, suddenly it can solve much harder problems. That is the core insight of this paper.

Wei et al. introduce chain-of-thought (CoT) prompting, a remarkably simple technique: instead of giving a language model plain input-output examples in a few-shot prompt, you include intermediate reasoning steps—a "chain of thought"—in each example. The model then learns to produce its own chain of thought before arriving at an answer. No fine-tuning, no new training data, no architectural changes—just a different way of writing your prompt.

Read more »

1. What This Paper Does

Ring Attention solves one of the most stubborn problems in modern deep learning: the memory wall that prevents Transformers from processing long sequences. Even with memory-efficient attention (FlashAttention) and blockwise computation, the output activations of each Transformer layer must be stored and have size proportional to the sequence length. For 100 million tokens with hidden size 1024, that alone exceeds 1,000 GB — far beyond any single GPU or TPU.

The key insight is elegant: if you compute self-attention in a blockwise fashion (block-by-block), the order in which you process key-value blocks does not matter, as long as you combine the statistics correctly. This permutation invariance means you can place devices in a ring, have each device hold one query block, and rotate key-value blocks around the ring. While a device computes attention against the current key-value block, it simultaneously sends that block to the next device and receives a new block from the previous device. If the computation takes longer than the communication, the communication is completely hidden — zero overhead.

Read more »

1. What This Paper Does

Mamba introduces a selective state space model (selective SSM) that, for the first time, achieves Transformer-quality performance on language modeling while scaling linearly in sequence length — not quadratically like standard attention. The key insight is deceptively simple: make the parameters of a state space model depend on the input, so the model can choose what to remember and what to forget at each timestep. This seemingly small change has profound consequences: it breaks the mathematical equivalence between SSMs and convolutions that prior work relied on for efficiency, forcing the authors to invent a new hardware-aware parallel algorithm. The result is Mamba, a clean architecture with no attention and no MLP blocks, that runs at 5× the inference throughput of Transformers and matches or exceeds their quality across language, audio, and genomics.

Read more »

Executive Summary

This paper addresses one of the most critical bottlenecks in modern large language model training: optimizer state memory consumption. While most practitioners focus on reducing parameter count through methods like LoRA, GaLore takes a different approach by attacking the actual memory-dominant term—the first and second moment estimates maintained by optimizers like AdamW.

The key innovation is elegant: instead of forcing weights into low-rank spaces (which constrains model expressivity), GaLore exploits the observation that gradient matrices naturally exhibit low-rank structure during training. By decomposing gradients, performing optimizer updates in the compressed rank-r space, and projecting updates back to full rank, the method achieves:

Read more »