FlashAttention — In-Depth Technical Review (English)
Author: Zhongzhu Zhou
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv 2022 / NeurIPS 2022)
ArXiv: https://arxiv.org/abs/2205.14135
Abstract
FlashAttention is one of the papers that changed how the community thinks about efficient Transformer attention. The core point is subtle but extremely important: many previous attempts to accelerate attention focused on reducing FLOPs, while real GPU runtime was often dominated by memory traffic rather than arithmetic. This paper argues that exact attention can be made much faster without changing model semantics if we redesign the algorithm around the GPU memory hierarchy instead of around the usual matrix formula alone. The result is an exact attention kernel that avoids materializing the full attention matrix in high-bandwidth memory, cuts memory usage from quadratic extra memory to linear extra memory, and delivers large end-to-end speedups in BERT, GPT-2, and long-context tasks. In my view, the paper matters because it turned “attention optimization” from a mostly mathematical approximation game into a systems problem with a rigorous IO model.