Ring Attention: Blockwise Transformers for Near-Infinite Context — In-Depth Technical Review

Posted on 2026-03-29 Edited on 2026-04-02 In Paper Review

Ring Attention enables near-infinite context length by distributing attention computation across devices in a ring topology. Covers blockwise computation, online softmax, and memory analysis.

1. What This Paper Does

Ring Attention solves one of the most stubborn problems in modern deep learning: the memory wall that prevents Transformers from processing long sequences. Even with memory-efficient attention (FlashAttention) and blockwise computation, the output activations of each Transformer layer must be stored and have size proportional to the sequence length. For 100 million tokens with hidden size 1024, that alone exceeds 1,000 GB — far beyond any single GPU or TPU.

The key insight is elegant: if you compute self-attention in a blockwise fashion (block-by-block), the order in which you process key-value blocks does not matter, as long as you combine the statistics correctly. This permutation invariance means you can place devices in a ring, have each device hold one query block, and rotate key-value blocks around the ring. While a device computes attention against the current key-value block, it simultaneously sends that block to the next device and receives a new block from the previous device. If the computation takes longer than the communication, the communication is completely hidden — zero overhead.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces — In-Depth Technical Review

Posted on 2026-03-28 Edited on 2026-04-02 In Paper Review

Mamba introduces selective state space models as an alternative to Transformers with linear-time complexity. Covers selective scan, hardware-aware algorithms, and language modeling results.

1. What This Paper Does

Mamba introduces a selective state space model (selective SSM) that, for the first time, achieves Transformer-quality performance on language modeling while scaling linearly in sequence length — not quadratically like standard attention. The key insight is deceptively simple: make the parameters of a state space model depend on the input, so the model can choose what to remember and what to forget at each timestep. This seemingly small change has profound consequences: it breaks the mathematical equivalence between SSMs and convolutions that prior work relied on for efficiency, forcing the authors to invent a new hardware-aware parallel algorithm. The result is Mamba, a clean architecture with no attention and no MLP blocks, that runs at 5× the inference throughput of Transformers and matches or exceeds their quality across language, audio, and genomics.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — In-Depth Technical Review

Posted on 2026-03-27 Edited on 2026-04-02 In Paper Review

GaLore reduces memory requirements for LLM training through gradient low-rank projection. Covers the mathematical foundation, subspace switching, and memory savings analysis.

Executive Summary

This paper addresses one of the most critical bottlenecks in modern large language model training: optimizer state memory consumption. While most practitioners focus on reducing parameter count through methods like LoRA, GaLore takes a different approach by attacking the actual memory-dominant term—the first and second moment estimates maintained by optimizers like AdamW.

The key innovation is elegant: instead of forcing weights into low-rank spaces (which constrains model expressivity), GaLore exploits the observation that gradient matrices naturally exhibit low-rank structure during training. By decomposing gradients, performing optimizer updates in the compressed rank-r space, and projecting updates back to full rank, the method achieves:

Alpa: Automating Inter- and Intra-Operator Parallelism — In-Depth Technical Review

Posted on 2026-03-26 Edited on 2026-04-02 In Paper Review

Alpa automates the search for optimal parallelism strategies combining data, tensor, and pipeline parallelism. Covers the ILP formulation, inter-operator DP, and compilation framework.

Abstract

Training extremely large deep learning models — with billions or even hundreds of billions of parameters — on distributed GPU clusters is one of the most challenging engineering problems in modern machine learning. Today, achieving high throughput requires manually combining multiple forms of parallelism (data, tensor/operator, pipeline) in ways that are specific to both the model architecture and the cluster topology. This manual process demands deep systems expertise and does not generalize across different models or hardware setups. Alpa is a compiler system that automates this entire process. It introduces a new hierarchical view of parallelism — distinguishing between intra-operator and inter-operator parallelism — and uses a combination of Integer Linear Programming (ILP) and Dynamic Programming (DP) to automatically generate near-optimal execution plans. Evaluated on GPT-3, GShard MoE, and Wide-ResNet at up to 64 GPUs, Alpa matches or outperforms hand-tuned systems like Megatron-LM and DeepSpeed, and generalizes to models that have no manually-designed parallelization strategies at all.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — In-Depth Technical Review

Posted on 2026-03-25 Edited on 2026-04-02 In Paper Review

GPTQ enables efficient post-training quantization of large language models to 3-4 bits with minimal accuracy loss. Covers the layer-wise quantization algorithm, Hessian-based error correction, and practical deployment.

1. Introduction: Why Model Quantization Matters

Consider the practical challenge of deploying a state-of-the-art large language model. GPT-3, with its 175 billion parameters, requires 326 GB of memory when stored in the compact FP16 (16-bit floating point) format. This exceeds the capacity of even the most powerful single GPU available (NVIDIA A100 with 80 GB), meaning you need at least 5 GPUs just for inference — not training, just running the model to generate text.

近端策略优化算法（PPO）— 深度技术评审

Posted on 2026-03-24 In Paper Review

PPO（近端策略优化）是深度学习时代最具影响力的强化学习算法之一。本文从零开始详细讲解策略梯度、TRPO 到 PPO 裁剪目标的完整推导，覆盖 MuJoCo、Atari 实验分析，以及 PPO 在 RLHF/LLM 对齐中的核心作用。

摘要

强化学习（RL）已成为现代 AI 的基石 —— 从训练机器人行走，到通过 RLHF 将大语言模型与人类偏好对齐。这些突破的核心，是一个看似简单却影响深远的算法 —— 近端策略优化（Proximal Policy Optimization，PPO），由 John Schulman、Filip Wolski、Prafulla Dhariwal、Alec Radford 和 Oleg Klimov 于 2017 年在 OpenAI 提出。

PPO 提出了一族新的策略梯度方法，在与环境交互采样数据和使用随机梯度上升优化"替代"目标函数之间交替进行。与标准策略梯度方法每个数据样本只做一次梯度更新不同，PPO 引入了一种新颖的裁剪目标函数，使得同一批数据可以进行多轮小批量更新，而不会导致灾难性的策略大幅变化。其结果是：PPO 继承了信赖域策略优化（TRPO）的稳定性，同时实现大幅简化 —— 相比 vanilla 策略梯度只需修改几行代码。

实验结果表明 PPO 在广泛的基准测试中表现出色：连续控制任务（MuJoCo）、复杂的 3D 人形运动控制（Roboschool）以及 Atari 游戏。PPO 在采样效率、简洁性和运行时间之间实现了出色的平衡。

为什么这篇论文今天仍然重要？ PPO 可以说是深度学习时代最有影响力的 RL 算法。它成为了机器人训练、游戏 AI 的默认算法，更为关键的是 —— 它是 RLHF 流水线的核心优化引擎，ChatGPT、Claude、Gemini 等大语言模型的人类偏好对齐都依赖于它。理解 PPO 是从事 AI 对齐、LLM 训练或现代 RL 系统研究的必备前置知识。

Proximal Policy Optimization Algorithms — In-Depth Technical Review

Posted on 2026-03-24 Edited on 2026-04-02 In Paper Review

PPO is one of the most influential RL algorithms. This review covers policy gradients, TRPO, the clipped surrogate objective, and PPO's role in RLHF/LLM alignment.

Abstract

Reinforcement learning (RL) has become a cornerstone of modern AI — from training robots to walk, to aligning large language models with human preferences via RLHF. At the heart of many of these breakthroughs lies a deceptively simple algorithm called Proximal Policy Optimization (PPO), introduced by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov at OpenAI in 2017.

PPO proposes a new family of policy gradient methods that alternate between sampling data through interaction with the environment and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO introduces a novel clipped objective function that enables multiple epochs of minibatch updates without catastrophically large policy changes. The result is an algorithm that inherits the stability of Trust Region Policy Optimization (TRPO) while being dramatically simpler to implement — requiring only a few lines of code change to a vanilla policy gradient implementation.

MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Technical Review

Posted on 2026-03-23 In AI Research , Agent Systems

A detailed technical review of Google DeepMind's paper 'A Subgoal-driven Framework for Improving Long-Horizon LLM Agents', analyzing how MiRA uses milestone-based subgoal decomposition and potential-based reward shaping to overcome planning bottlenecks in long-horizon web navigation.

MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Detailed Technical Review

Paper: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Authors: Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette
Affiliation: Google DeepMind
Published: March 20, 2026 (arXiv: 2603.19685)
Reviewer: Zhongzhu Zhou
Review Date: March 23, 2026

I. Prerequisites: What You Need to Know

Before diving into the paper's contributions, let us establish the foundational concepts necessary for understanding the MiRA framework. This section is designed to make the paper accessible even if you are new to the intersection of LLM agents and reinforcement learning.

Attention Is All You Need: The Transformer — In-Depth Technical Review

Posted on 2026-03-22 Edited on 2026-03-23 In Paper Reading , Deep Learning

An in-depth technical review of the Transformer architecture from "Attention Is All You Need", covering self-attention mechanics, positional encoding, multi-head attention, training details, and lasting impact.

Author: Zhongzhu Zhou Paper: Attention Is All You Need (NeurIPS 2017) ArXiv: https://arxiv.org/abs/1706.03762

Abstract

In June 2017, a team of eight researchers at Google Brain and Google Research published what would become arguably the most influential paper in modern artificial intelligence. "Attention Is All You Need" introduced the Transformer, a neural network architecture that completely discards the recurrent and convolutional building blocks that had dominated sequence modeling for decades, relying instead entirely on attention mechanisms. The result was a model that was not only more powerful but dramatically more parallelizable, training to state-of-the-art quality on machine translation in just 3.5 days on 8 GPUs — a fraction of the cost of competing approaches.

BitNet: Scaling 1-bit Transformers for Large Language Models — In-Depth Technical Review

Posted on 2026-03-21 Edited on 2026-03-23 In Paper Reading , Efficient AI

An in-depth technical review of BitNet, covering 1-bit weight quantization, BitLinear layer design, scaling laws for binary Transformers, and practical deployment implications.

Author: Zhongzhu Zhou Paper: BitNet: Scaling 1-bit Transformers for Large Language Models (arXiv 2023) ArXiv: https://arxiv.org/abs/2310.11453

Abstract

Large language models have grown so large that their deployment costs — in terms of memory, compute, and energy — now rival or exceed the cost of training itself. BitNet, introduced by researchers at Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University, proposes a radical solution: train Transformer-based language models with 1-bit weights from scratch. By replacing the standard nn.Linear layer with a custom BitLinear layer that binarizes weights to +1 or −1 and quantizes activations to 8-bit integers, BitNet achieves competitive performance to full-precision (FP16) Transformers while dramatically reducing memory footprint and energy consumption. More provocatively, the authors demonstrate that BitNet follows a scaling law similar to full-precision models, suggesting that 1-bit architectures could scale to hundreds of billions of parameters without sacrificing the predictable performance improvements that make scaling worthwhile.

ZeRO: Shattering the Memory Wall — How DeepSpeed Trains Trillion-Parameter Models

Posted on 2026-03-19 Edited on 2026-04-03

A technical review of ZeRO (Zero Redundancy Optimizer), analyzing how partitioning optimizer states, gradients, and parameters across data-parallel processes enables training of trillion-parameter models.

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — In-Depth Technical Review (English)

Author: Zhongzhu Zhou Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 2020 / arXiv 2019) ArXiv: https://arxiv.org/abs/1910.02054 Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft)

Abstract

ZeRO (Zero Redundancy Optimizer) is one of the most important systems papers in the large-model training era. It fundamentally rethinks how memory is consumed during distributed deep learning training and proposes a family of optimizations that eliminate redundant memory storage across data-parallel processes — without sacrificing computational efficiency. The result is staggering: ZeRO enables training of models with over 100 billion parameters on 400 GPUs with super-linear speedup, achieving 15 PetaFlops of throughput. This represents an 8× increase in trainable model size and 10× improvement in throughput over the state-of-the-art at the time. Perhaps most importantly, ZeRO democratizes large model training: it allows data scientists to train models with up to 13 billion parameters using nothing more than standard data parallelism — no model parallelism, no pipeline parallelism, no model refactoring required. ZeRO is the backbone of Microsoft's DeepSpeed library and powered Turing-NLG, which at the time was the world's largest language model (17B parameters).

MetaGPT: When LLM Agents Form a Software Company — Multi-Agent Collaboration Done Right

Posted on 2026-03-16 Edited on 2026-04-03

A technical review of MetaGPT, analyzing how encoding human software development workflows (SOP) into multi-agent systems with structured communication reduces errors in automated code generation.

MetaGPT — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: Meta Programming for a Multi-Agent Collaborative Framework (arXiv 2023/2024)
ArXiv: https://arxiv.org/abs/2308.00352

Abstract

MetaGPT is an early but still very influential paper in the modern “LLM agents building software” story. Its central claim is that simply letting several language-model agents chat with each other is not enough for reliable complex work, because unconstrained dialogue amplifies ambiguity, hallucination, and wasted tokens. The paper proposes a more disciplined alternative: organize LLM agents like a software company, give them specialized roles, force them to exchange structured artifacts instead of casual conversation, and run the workflow through Standard Operating Procedures (SOPs). On code-generation benchmarks such as HumanEval and MBPP, and on a custom SoftwareDev benchmark, MetaGPT reports stronger results than prior multi-agent systems and argues that the gain comes from workflow engineering as much as from model capability. I think the paper matters because it reframed multi-agent prompting from “more agents means more intelligence” into “the right interfaces, handoff documents, and execution discipline matter at least as much as the number of agents.”

Zhongzhu's Blog

Ring Attention: Blockwise Transformers for Near-Infinite Context — In-Depth Technical Review

1. What This Paper Does

Mamba: Linear-Time Sequence Modeling with Selective State Spaces — In-Depth Technical Review

1. What This Paper Does

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — In-Depth Technical Review

Executive Summary

Alpa: Automating Inter- and Intra-Operator Parallelism — In-Depth Technical Review

Abstract

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — In-Depth Technical Review

Table of Contents

1. Introduction: Why Model Quantization Matters

近端策略优化算法（PPO）— 深度技术评审

摘要

Proximal Policy Optimization Algorithms — In-Depth Technical Review

Abstract

MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Technical Review

MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Detailed Technical Review

I. Prerequisites: What You Need to Know

Attention Is All You Need: The Transformer — In-Depth Technical Review

Abstract

BitNet: Scaling 1-bit Transformers for Large Language Models — In-Depth Technical Review

Abstract

ZeRO: Shattering the Memory Wall — How DeepSpeed Trains Trillion-Parameter Models

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — In-Depth Technical Review (English)

Abstract

MetaGPT: When LLM Agents Form a Software Company — Multi-Agent Collaboration Done Right

MetaGPT — In-Depth Technical Review (English)

Abstract