0%

FlashAttention — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv 2022 / NeurIPS 2022)
ArXiv: https://arxiv.org/abs/2205.14135


Abstract

FlashAttention is one of the papers that changed how the community thinks about efficient Transformer attention. The core point is subtle but extremely important: many previous attempts to accelerate attention focused on reducing FLOPs, while real GPU runtime was often dominated by memory traffic rather than arithmetic. This paper argues that exact attention can be made much faster without changing model semantics if we redesign the algorithm around the GPU memory hierarchy instead of around the usual matrix formula alone. The result is an exact attention kernel that avoids materializing the full attention matrix in high-bandwidth memory, cuts memory usage from quadratic extra memory to linear extra memory, and delivers large end-to-end speedups in BERT, GPT-2, and long-context tasks. In my view, the paper matters because it turned “attention optimization” from a mostly mathematical approximation game into a systems problem with a rigorous IO model.

Read more »

LoRA: Low-Rank Adaptation of Large Language Models — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: LoRA: Low-Rank Adaptation of Large Language Models (ICLR 2022)
ArXiv: https://arxiv.org/abs/2106.09685


TL;DR

LoRA freezes pretrained model weights and injects trainable low-rank matrices into selected linear layers. This simple reparameterization preserves downstream quality while reducing trainable parameters by orders of magnitude and cutting optimizer-state memory sharply. The practical consequence is that many task-specific adapters can be trained and deployed cheaply on top of one frozen base model. The design also opened a long line of adapter-style methods used in modern LLM fine-tuning pipelines.

Read more »

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — In-Depth Technical Review (English)

Author: Steve
TL;DR: I think this paper is one of the clearest “systems bridge” papers in large-model training. Its core contribution is not a brand-new model architecture, but a training recipe for making very large Transformers practical: use tensor parallelism inside a node, pipeline parallelism across nodes, and data parallelism across replicas, then make the composition efficient with a better pipeline schedule and communication-aware engineering.
Estimated reading time: 28–35 minutes
Paper: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021)
ArXiv: https://arxiv.org/abs/2104.04473

Read more »

Rethinking Memory and Communication Costs for Efficient Large Language Model Training — In-Depth Technical Review (English)

Author: Steve
Paper: Rethinking Memory and Communication Costs for Efficient Large Language Model Training (arXiv 2310.06003, 2023)
ArXiv: https://arxiv.org/abs/2310.06003
TL;DR: PaRO is a practical systems paper that rebalances ZeRO-style sharding by allowing partial redundancy to cut expensive cross-group communication, then adds HO-Ring to improve inter-node collective efficiency. The key result is 1.19×–2.50× throughput gain over prior baselines plus 36.5% communication efficiency gain over standard Ring in their setting.


Read more »

Speculative Decoding — In-Depth Technical Review (English)

Author: zhongzhu zhou
Paper: Fast Inference from Transformers via Speculative Decoding (ICML 2023 Oral)
ArXiv: https://arxiv.org/abs/2211.17192


TL;DR (1-minute version)

Speculative decoding is a way to make large language model (LLM) inference faster without changing the final output distribution of the target model. The key trick is simple but powerful: use a much smaller, faster draft model to “guess” several future tokens, then ask the large target model to verify those guesses in one parallel pass. If a guessed token is valid under a mathematically correct acceptance rule, keep it; if not, correct at the first mismatch and continue.

Read more »

InstructGPT (2203.02155) — Technical Review

TL;DR (1 minute): InstructGPT is the paper that turned “next-token prediction models” into “helpful assistant models” at scale. The core pipeline is simple but powerful: (1) supervised fine-tuning on human-written demonstrations, (2) reward-model training from pairwise preference data, and (3) PPO optimization against that reward while constraining drift from the base model via a KL penalty. I think this paper’s long-term contribution is not just better outputs, but a production training recipe that changed how almost all modern assistant models are built.

Estimated reading time: 45–60 minutes

Read more »

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation — In-Depth Technical Review (English)

Author: zhongzhu zhou
Paper: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (COLM 2024)
ArXiv: https://arxiv.org/abs/2308.08155


TL;DR: AutoGen is a practical framework for building multi-agent LLM systems by modeling each role (planner, coder, reviewer, tool user, human proxy) as a conversational agent. The key idea is not only “more agents,” but programmable interaction patterns that allow humans and tools to be inserted where reliability matters. It improves developer productivity and flexibility, but it does not magically solve verification, cost control, or loop safety.
Estimated reading time: 30–40 minutes

Read more »

Generative Agents: Interactive Simulacra of Human Behavior — Technical Review (EN)

TL;DR: This paper introduces LLM-powered “generative agents” with a memory-retrieval-reflection-action loop, and demonstrates believable long-horizon social behavior in a sandbox town.
Estimated reading time: 18–22 minutes

1) What problem is being solved?

Large language models can generate fluent text, but realistic persistent behavior (remembering prior events, planning routines, and coordinating socially) is hard. The paper asks: can we build autonomous agents that behave coherently over days, not just one-shot prompts?

2) Core method (high level)

The architecture has three main components:

  • Memory stream: every observed event and self-generated action is stored as natural-language memory.
  • Retrieval with relevance/recency/importance: when deciding what to do, agents retrieve salient memories rather than the full history.
  • Reflection: agents periodically summarize experience into higher-level inferences (e.g., social beliefs, goals), which improves consistency.
Read more »

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Deep Technical Review (English)

Author: zhongzhu zhou
Paper: SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (arXiv 2024)
ArXiv: https://arxiv.org/abs/2405.15793


TL;DR

If you only remember one thing from this paper, remember this: for coding agents, interface quality is model quality. SWE-agent is not mainly a new foundation model; it is a better agent-computer interface (ACI) for software engineering workflows. That single design choice materially improves autonomous bug-fixing performance.

The paper reports strong gains, including 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix. Those numbers are important, but the deeper contribution is architectural: an LLM agent should interact with repositories via a constrained, software-native action interface, not a fragile free-form command stream.

Read more »

Self-Refine (arXiv:2303.17651) — Technical Review

TL;DR: Self-Refine turns one-shot prompting into a simple generate → critique → revise loop that runs with a single LLM and no extra training. Across seven tasks (sentiment reversal, review rewriting, dialogue response, code optimization, etc.), iterative self-feedback substantially improves quality while staying easy to deploy.

Estimated reading time: 30–40 minutes


0. Why this paper matters

Imagine writing an essay in one pass and submitting immediately. Most people do better if they:

  1. write a draft,
  2. review their own draft with a checklist,
  3. revise,
  4. repeat once or twice.

Self-Refine applies this exact human workflow to LLM inference. Instead of forcing the model to be perfect in one shot, we let it “think in rounds.” The same model plays three roles:

Read more »

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — In-Depth Technical Review (English)

TL;DR: DeepSeekMath combines large-scale math-centric continued pretraining with a reinforcement-learning stage built around GRPO (Group Relative Policy Optimization), and shows that an open 7B model can become highly competitive on difficult math benchmarks when data curation and RL objective design are tightly coupled.

Estimated reading time: 20–25 minutes

Author: Zhongzhu Zhou
Paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (arXiv 2024)
ArXiv: https://arxiv.org/abs/2402.03300


Read more »

Reflexion: Language Agents with Verbal Reinforcement Learning — Long-Form Technical Review (English)

Author: zhongzhu zhou
Paper: Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023)
ArXiv: https://arxiv.org/abs/2303.11366


TL;DR

Reflexion replaces expensive parameter updates with a lightweight language-space policy update loop: after each episode, the agent writes a compact reflection (what failed, why, and what to do differently), and this memory conditions the next attempt. The result is practical online adaptation without fine-tuning. In tasks where errors are diagnosable in language (tool misuse, missing constraints, wrong decomposition), Reflexion gives a strong retry efficiency boost over plain ReAct and vanilla prompting.

Read more »