FlashAttention: The IO-Aware Algorithm That Made Transformers Actually Fast

Posted on 2026-03-14 Edited on 2026-04-03

A technical review of FlashAttention, analyzing how IO-aware tiling and kernel fusion achieve exact attention computation that is both faster and more memory-efficient than standard implementations.

FlashAttention — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv 2022 / NeurIPS 2022)
ArXiv: https://arxiv.org/abs/2205.14135

Abstract

FlashAttention is one of the papers that changed how the community thinks about efficient Transformer attention. The core point is subtle but extremely important: many previous attempts to accelerate attention focused on reducing FLOPs, while real GPU runtime was often dominated by memory traffic rather than arithmetic. This paper argues that exact attention can be made much faster without changing model semantics if we redesign the algorithm around the GPU memory hierarchy instead of around the usual matrix formula alone. The result is an exact attention kernel that avoids materializing the full attention matrix in high-bandwidth memory, cuts memory usage from quadratic extra memory to linear extra memory, and delivers large end-to-end speedups in BERT, GPT-2, and long-context tasks. In my view, the paper matters because it turned “attention optimization” from a mostly mathematical approximation game into a systems problem with a rigorous IO model.

LoRA: Fine-Tuning Giant Models with Pocket Change — The Low-Rank Revolution

Posted on 2026-03-13 Edited on 2026-04-03

A technical review of LoRA (Low-Rank Adaptation), analyzing how injecting trainable low-rank decomposition matrices enables parameter-efficient fine-tuning of large language models with minimal overhead.

LoRA: Low-Rank Adaptation of Large Language Models — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: LoRA: Low-Rank Adaptation of Large Language Models (ICLR 2022)
ArXiv: https://arxiv.org/abs/2106.09685

TL;DR

LoRA freezes pretrained model weights and injects trainable low-rank matrices into selected linear layers. This simple reparameterization preserves downstream quality while reducing trainable parameters by orders of magnitude and cutting optimizer-state memory sharply. The practical consequence is that many task-specific adapters can be trained and deployed cheaply on top of one frozen base model. The design also opened a long line of adapter-style methods used in modern LLM fine-tuning pipelines.

Megatron-LM: NVIDIA's Blueprint for Training Billion-Parameter Models at Scale

Posted on 2026-03-12 Edited on 2026-04-03

A technical review of Megatron-LM's efficient large-scale training system, analyzing how tensor, pipeline, and data parallelism are composed to train trillion-parameter models on GPU clusters.

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — In-Depth Technical Review (English)

Author: Steve
TL;DR: I think this paper is one of the clearest “systems bridge” papers in large-model training. Its core contribution is not a brand-new model architecture, but a training recipe for making very large Transformers practical: use tensor parallelism inside a node, pipeline parallelism across nodes, and data parallelism across replicas, then make the composition efficient with a better pipeline schedule and communication-aware engineering.
Estimated reading time: 28–35 minutes
Paper: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021)
ArXiv: https://arxiv.org/abs/2104.04473

PaRO: Smarter Partitioning for Distributed Training — Beyond ZeRO's One-Size-Fits-All

Posted on 2026-03-12 Edited on 2026-04-16

A technical review of PaRO, analyzing how partial redundancy optimization in data-parallel training reduces memory overhead while minimizing communication costs through selective parameter partitioning.

Rethinking Memory and Communication Costs for Efficient Large Language Model Training — In-Depth Technical Review (English)

Author: Steve
Paper: Rethinking Memory and Communication Costs for Efficient Large Language Model Training (arXiv 2310.06003, 2023)
ArXiv: https://arxiv.org/abs/2310.06003
TL;DR: PaRO is a practical systems paper that rebalances ZeRO-style sharding by allowing partial redundancy to cut expensive cross-group communication, then adds HO-Ring to improve inter-node collective efficiency. The key result is 1.19×–2.50× throughput gain over prior baselines plus 36.5% communication efficiency gain over standard Ring in their setting.

Speculative Decoding: Making LLM Inference 2-3× Faster Without Losing a Single Token

Posted on 2026-03-11 Edited on 2026-04-16

A technical review of Speculative Decoding, analyzing how using a smaller draft model to propose tokens and a larger model to verify them achieves 2-3x inference speedup with mathematically identical output distributions.

Speculative Decoding — In-Depth Technical Review (English)

Author: zhongzhu zhou
Paper: Fast Inference from Transformers via Speculative Decoding (ICML 2023 Oral)
ArXiv: https://arxiv.org/abs/2211.17192

TL;DR (1-minute version)

Speculative decoding is a way to make large language model (LLM) inference faster without changing the final output distribution of the target model. The key trick is simple but powerful: use a much smaller, faster draft model to “guess” several future tokens, then ask the large target model to verify those guesses in one parallel pass. If a guessed token is valid under a mathematically correct acceptance rule, keep it; if not, correct at the first mismatch and continue.

InstructGPT: The RLHF Recipe That Turned GPT-3 Into a Helpful Assistant

Posted on 2026-03-10 Edited on 2026-04-16 In Paper Review , RL Training

A detailed technical review of InstructGPT (OpenAI, 2022), analyzing how Reinforcement Learning from Human Feedback (RLHF) with a three-stage pipeline — SFT, reward modeling, and PPO — transformed next-token prediction into instruction-following behavior that aligned with human intent.

InstructGPT (2203.02155) — Technical Review

TL;DR (1 minute): InstructGPT is the paper that turned “next-token prediction models” into “helpful assistant models” at scale. The core pipeline is simple but powerful: (1) supervised fine-tuning on human-written demonstrations, (2) reward-model training from pairwise preference data, and (3) PPO optimization against that reward while constraining drift from the base model via a KL penalty. I think this paper’s long-term contribution is not just better outputs, but a production training recipe that changed how almost all modern assistant models are built.

Estimated reading time: 45–60 minutes

AutoGen: Microsoft's Framework for Building Multi-Agent Conversations That Actually Work

Posted on 2026-03-09 Edited on 2026-04-16

A technical review of AutoGen, examining how multi-agent conversation frameworks with customizable agents enable complex LLM applications through cooperative dialogue patterns.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation — In-Depth Technical Review (English)

Author: zhongzhu zhou
Paper: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (COLM 2024)
ArXiv: https://arxiv.org/abs/2308.08155

TL;DR: AutoGen is a practical framework for building multi-agent LLM systems by modeling each role (planner, coder, reviewer, tool user, human proxy) as a conversational agent. The key idea is not only “more agents,” but programmable interaction patterns that allow humans and tools to be inserted where reliability matters. It improves developer productivity and flexibility, but it does not magically solve verification, cost control, or loop safety.
Estimated reading time: 30–40 minutes

Generative Agents: 25 AI Characters Living in a Simulated Town — Believable Human Behavior from LLMs

Posted on 2026-03-09 Edited on 2026-04-03

A technical review of Generative Agents, analyzing how LLM-powered agents with memory, reflection, and planning create believable simulations of human behavior in interactive sandbox environments.

Generative Agents: Interactive Simulacra of Human Behavior — Technical Review (EN)

TL;DR: This paper introduces LLM-powered “generative agents” with a memory-retrieval-reflection-action loop, and demonstrates believable long-horizon social behavior in a sandbox town.
Estimated reading time: 18–22 minutes

1) What problem is being solved?

Large language models can generate fluent text, but realistic persistent behavior (remembering prior events, planning routines, and coordinating socially) is hard. The paper asks: can we build autonomous agents that behave coherently over days, not just one-shot prompts?

2) Core method (high level)

The architecture has three main components:

Memory stream: every observed event and self-generated action is stored as natural-language memory.
Retrieval with relevance/recency/importance: when deciding what to do, agents retrieve salient memories rather than the full history.
Reflection: agents periodically summarize experience into higher-level inferences (e.g., social beliefs, goals), which improves consistency.

SWE-agent: Turning LLMs Into Autonomous Software Engineers That Fix Real GitHub Issues

Posted on 2026-03-06 Edited on 2026-04-16

A technical review of SWE-agent, analyzing how Agent-Computer Interface (ACI) design principles enable LLM agents to autonomously resolve real-world GitHub issues.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Deep Technical Review (English)

Author: zhongzhu zhou
Paper: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (arXiv 2024)
ArXiv: https://arxiv.org/abs/2405.15793

TL;DR

If you only remember one thing from this paper, remember this: for coding agents, interface quality is model quality. SWE-agent is not mainly a new foundation model; it is a better agent-computer interface (ACI) for software engineering workflows. That single design choice materially improves autonomous bug-fixing performance.

The paper reports strong gains, including 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix. Those numbers are important, but the deeper contribution is architectural: an LLM agent should interact with repositories via a constrained, software-native action interface, not a fragile free-form command stream.

Self-Refine: Teaching LLMs to Critique and Improve Their Own Output — No Extra Training Needed

Posted on 2026-02-23 Edited on 2026-04-16

A technical review of Self-Refine, analyzing how iterative self-feedback loops enable LLMs to progressively improve their own outputs without external training or human supervision.

Self-Refine (arXiv:2303.17651) — Technical Review

TL;DR: Self-Refine turns one-shot prompting into a simple generate → critique → revise loop that runs with a single LLM and no extra training. Across seven tasks (sentiment reversal, review rewriting, dialogue response, code optimization, etc.), iterative self-feedback substantially improves quality while staying easy to deploy.

Estimated reading time: 30–40 minutes

0. Why this paper matters

Imagine writing an essay in one pass and submitting immediately. Most people do better if they:

write a draft,
review their own draft with a checklist,
revise,
repeat once or twice.

Self-Refine applies this exact human workflow to LLM inference. Instead of forcing the model to be perfect in one shot, we let it “think in rounds.” The same model plays three roles:

DeepSeekMath: How 120B Tokens of Math Data and GRPO Rival GPT-4 on Competition Problems

Posted on 2026-02-20 Edited on 2026-04-16

A detailed technical review of DeepSeekMath, analyzing how continued pretraining on math corpora combined with Group Relative Policy Optimization (GRPO) enables a 7B open model to rival frontier systems on challenging math benchmarks.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — In-Depth Technical Review (English)

TL;DR: DeepSeekMath combines large-scale math-centric continued pretraining with a reinforcement-learning stage built around GRPO (Group Relative Policy Optimization), and shows that an open 7B model can become highly competitive on difficult math benchmarks when data curation and RL objective design are tightly coupled.

Estimated reading time: 20–25 minutes

Author: Zhongzhu Zhou
Paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (arXiv 2024)
ArXiv: https://arxiv.org/abs/2402.03300

Reflexion: LLM Agents That Learn from Failure Through Verbal Self-Reflection

Posted on 2026-02-20 Edited on 2026-04-03

A technical review of Reflexion, exploring how language agents use verbal self-reflection as reinforcement signals to iteratively improve performance on coding, reasoning, and decision-making tasks.

Reflexion: Language Agents with Verbal Reinforcement Learning — Long-Form Technical Review (English)

Author: zhongzhu zhou
Paper: Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023)
ArXiv: https://arxiv.org/abs/2303.11366

TL;DR

Reflexion replaces expensive parameter updates with a lightweight language-space policy update loop: after each episode, the agent writes a compact reflection (what failed, why, and what to do differently), and this memory conditions the next attempt. The result is practical online adaptation without fine-tuning. In tasks where errors are diagnosable in language (tool misuse, missing constraints, wrong decomposition), Reflexion gives a strong retry efficiency boost over plain ReAct and vanilla prompting.