0%

1. What This Paper Does

Imagine you ask a very smart assistant to solve a math word problem. If you just say "answer this," the assistant might blurt out a number without thinking it through—and often get it wrong. But if you first show the assistant a few examples of how to think step by step, suddenly it can solve much harder problems. That is the core insight of this paper.

Wei et al. introduce chain-of-thought (CoT) prompting, a remarkably simple technique: instead of giving a language model plain input-output examples in a few-shot prompt, you include intermediate reasoning steps—a "chain of thought"—in each example. The model then learns to produce its own chain of thought before arriving at an answer. No fine-tuning, no new training data, no architectural changes—just a different way of writing your prompt.

Read more »

1. What This Paper Does

Ring Attention solves one of the most stubborn problems in modern deep learning: the memory wall that prevents Transformers from processing long sequences. Even with memory-efficient attention (FlashAttention) and blockwise computation, the output activations of each Transformer layer must be stored and have size proportional to the sequence length. For 100 million tokens with hidden size 1024, that alone exceeds 1,000 GB — far beyond any single GPU or TPU.

The key insight is elegant: if you compute self-attention in a blockwise fashion (block-by-block), the order in which you process key-value blocks does not matter, as long as you combine the statistics correctly. This permutation invariance means you can place devices in a ring, have each device hold one query block, and rotate key-value blocks around the ring. While a device computes attention against the current key-value block, it simultaneously sends that block to the next device and receives a new block from the previous device. If the computation takes longer than the communication, the communication is completely hidden — zero overhead.

Read more »

1. What This Paper Does

Mamba introduces a selective state space model (selective SSM) that, for the first time, achieves Transformer-quality performance on language modeling while scaling linearly in sequence length — not quadratically like standard attention. The key insight is deceptively simple: make the parameters of a state space model depend on the input, so the model can choose what to remember and what to forget at each timestep. This seemingly small change has profound consequences: it breaks the mathematical equivalence between SSMs and convolutions that prior work relied on for efficiency, forcing the authors to invent a new hardware-aware parallel algorithm. The result is Mamba, a clean architecture with no attention and no MLP blocks, that runs at 5× the inference throughput of Transformers and matches or exceeds their quality across language, audio, and genomics.

Read more »

Executive Summary

This paper addresses one of the most critical bottlenecks in modern large language model training: optimizer state memory consumption. While most practitioners focus on reducing parameter count through methods like LoRA, GaLore takes a different approach by attacking the actual memory-dominant term—the first and second moment estimates maintained by optimizers like AdamW.

The key innovation is elegant: instead of forcing weights into low-rank spaces (which constrains model expressivity), GaLore exploits the observation that gradient matrices naturally exhibit low-rank structure during training. By decomposing gradients, performing optimizer updates in the compressed rank-r space, and projecting updates back to full rank, the method achieves:

Read more »

Abstract

Training extremely large deep learning models — with billions or even hundreds of billions of parameters — on distributed GPU clusters is one of the most challenging engineering problems in modern machine learning. Today, achieving high throughput requires manually combining multiple forms of parallelism (data, tensor/operator, pipeline) in ways that are specific to both the model architecture and the cluster topology. This manual process demands deep systems expertise and does not generalize across different models or hardware setups. Alpa is a compiler system that automates this entire process. It introduces a new hierarchical view of parallelism — distinguishing between intra-operator and inter-operator parallelism — and uses a combination of Integer Linear Programming (ILP) and Dynamic Programming (DP) to automatically generate near-optimal execution plans. Evaluated on GPT-3, GShard MoE, and Wide-ResNet at up to 64 GPUs, Alpa matches or outperforms hand-tuned systems like Megatron-LM and DeepSpeed, and generalizes to models that have no manually-designed parallelization strategies at all.

Read more »

Table of Contents

  1. Introduction: Why Model Quantization Matters
  2. Prerequisites: Background Knowledge You Need
  3. The GPTQ Method: Three Key Insights
  4. The Full Algorithm
  5. Experimental Results and Analysis
  6. Practical Speedups and Deployment
  7. Extreme Quantization and Grouping
  8. Limitations and Discussion
  9. Conclusion and Impact

1. Introduction: Why Model Quantization Matters

Consider the practical challenge of deploying a state-of-the-art large language model. GPT-3, with its 175 billion parameters, requires 326 GB of memory when stored in the compact FP16 (16-bit floating point) format. This exceeds the capacity of even the most powerful single GPU available (NVIDIA A100 with 80 GB), meaning you need at least 5 GPUs just for inference — not training, just running the model to generate text.

Read more »

Abstract

Reinforcement learning (RL) has become a cornerstone of modern AI — from training robots to walk, to aligning large language models with human preferences via RLHF. At the heart of many of these breakthroughs lies a deceptively simple algorithm called Proximal Policy Optimization (PPO), introduced by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI in 2017.

PPO proposes a new family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO introduces a novel clipped objective function that enables multiple epochs of minibatch updates without catastrophically large policy changes. The result is an algorithm that inherits the stability of Trust Region Policy Optimization (TRPO) while being dramatically simpler to implement — requiring only a few lines of code change to a vanilla policy gradient implementation.

Read more »

摘要

强化学习(RL)已成为现代 AI 的基石 —— 从训练机器人行走,到通过 RLHF 将大语言模型与人类偏好对齐。这些突破的核心,是一个看似简单却影响深远的算法 —— 近端策略优化(Proximal Policy Optimization,PPO),由 John Schulman、Filip Wolski、Prafulla Dhariwal、Alec Radford 和 Oleg Klimov 于 2017 年在 OpenAI 提出。

PPO 提出了一族新的策略梯度方法,在与环境交互采样数据和使用随机梯度上升优化"替代"目标函数之间交替进行。与标准策略梯度方法每个数据样本只做一次梯度更新不同,PPO 引入了一种新颖的裁剪目标函数,使得同一批数据可以进行多轮小批量更新,而不会导致灾难性的策略大幅变化。其结果是:PPO 继承了信赖域策略优化(TRPO)的稳定性,同时实现大幅简化 —— 相比 vanilla 策略梯度只需修改几行代码。

实验结果表明 PPO 在广泛的基准测试中表现出色:连续控制任务(MuJoCo)、复杂的 3D 人形运动控制(Roboschool)以及 Atari 游戏。PPO 在采样效率、简洁性和运行时间之间实现了出色的平衡。

为什么这篇论文今天仍然重要? PPO 可以说是深度学习时代最有影响力的 RL 算法。它成为了机器人训练、游戏 AI 的默认算法,更为关键的是 —— 它是 RLHF 流水线的核心优化引擎,ChatGPT、Claude、Gemini 等大语言模型的人类偏好对齐都依赖于它。理解 PPO 是从事 AI 对齐、LLM 训练或现代 RL 系统研究的必备前置知识。

Read more »

MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Detailed Technical Review

Paper: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Authors: Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette
Affiliation: Google DeepMind
Published: March 20, 2026 (arXiv: 2603.19685)
Reviewer: Zhongzhu Zhou
Review Date: March 23, 2026


I. Prerequisites: What You Need to Know

Before diving into the paper's contributions, let us establish the foundational concepts necessary for understanding the MiRA framework. This section is designed to make the paper accessible even if you are new to the intersection of LLM agents and reinforcement learning.

Read more »

Author: Zhongzhu Zhou Paper: Attention Is All You Need (NeurIPS 2017) ArXiv: https://arxiv.org/abs/1706.03762


Abstract

In June 2017, a team of eight researchers at Google Brain and Google Research published what would become arguably the most influential paper in modern artificial intelligence. "Attention Is All You Need" introduced the Transformer, a neural network architecture that completely discards the recurrent and convolutional building blocks that had dominated sequence modeling for decades, relying instead entirely on attention mechanisms. The result was a model that was not only more powerful but dramatically more parallelizable, training to state-of-the-art quality on machine translation in just 3.5 days on 8 GPUs — a fraction of the cost of competing approaches.

Read more »

Author: Zhongzhu Zhou Paper: BitNet: Scaling 1-bit Transformers for Large Language Models (arXiv 2023) ArXiv: https://arxiv.org/abs/2310.11453


Abstract

Large language models have grown so large that their deployment costs — in terms of memory, compute, and energy — now rival or exceed the cost of training itself. BitNet, introduced by researchers at Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University, proposes a radical solution: train Transformer-based language models with 1-bit weights from scratch. By replacing the standard nn.Linear layer with a custom BitLinear layer that binarizes weights to +1 or −1 and quantizes activations to 8-bit integers, BitNet achieves competitive performance to full-precision (FP16) Transformers while dramatically reducing memory footprint and energy consumption. More provocatively, the authors demonstrate that BitNet follows a scaling law similar to full-precision models, suggesting that 1-bit architectures could scale to hundreds of billions of parameters without sacrificing the predictable performance improvements that make scaling worthwhile.

Read more »

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — In-Depth Technical Review (English)

Author: Zhongzhu Zhou Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 2020 / arXiv 2019) ArXiv: https://arxiv.org/abs/1910.02054 Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft)


Abstract

ZeRO (Zero Redundancy Optimizer) is one of the most important systems papers in the large-model training era. It fundamentally rethinks how memory is consumed during distributed deep learning training and proposes a family of optimizations that eliminate redundant memory storage across data-parallel processes — without sacrificing computational efficiency. The result is staggering: ZeRO enables training of models with over 100 billion parameters on 400 GPUs with super-linear speedup, achieving 15 PetaFlops of throughput. This represents an 8× increase in trainable model size and 10× improvement in throughput over the state-of-the-art at the time. Perhaps most importantly, ZeRO democratizes large model training: it allows data scientists to train models with up to 13 billion parameters using nothing more than standard data parallelism — no model parallelism, no pipeline parallelism, no model refactoring required. ZeRO is the backbone of Microsoft's DeepSpeed library and powered Turing-NLG, which at the time was the world's largest language model (17B parameters).

Read more »