0%

MiRA: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents — Detailed Technical Review

Paper: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Authors: Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette
Affiliation: Google DeepMind
Published: March 20, 2026 (arXiv: 2603.19685)
Reviewer: Zhongzhu Zhou
Review Date: March 23, 2026


I. Prerequisites: What You Need to Know

Before diving into the paper's contributions, let us establish the foundational concepts necessary for understanding the MiRA framework. This section is designed to make the paper accessible even if you are new to the intersection of LLM agents and reinforcement learning.

Read more »

Author: Zhongzhu Zhou Paper: Attention Is All You Need (NeurIPS 2017) ArXiv: https://arxiv.org/abs/1706.03762


Abstract

In June 2017, a team of eight researchers at Google Brain and Google Research published what would become arguably the most influential paper in modern artificial intelligence. "Attention Is All You Need" introduced the Transformer, a neural network architecture that completely discards the recurrent and convolutional building blocks that had dominated sequence modeling for decades, relying instead entirely on attention mechanisms. The result was a model that was not only more powerful but dramatically more parallelizable, training to state-of-the-art quality on machine translation in just 3.5 days on 8 GPUs — a fraction of the cost of competing approaches.

Read more »

Author: Zhongzhu Zhou Paper: BitNet: Scaling 1-bit Transformers for Large Language Models (arXiv 2023) ArXiv: https://arxiv.org/abs/2310.11453


Abstract

Large language models have grown so large that their deployment costs — in terms of memory, compute, and energy — now rival or exceed the cost of training itself. BitNet, introduced by researchers at Microsoft Research, the University of Chinese Academy of Sciences, and Tsinghua University, proposes a radical solution: train Transformer-based language models with 1-bit weights from scratch. By replacing the standard nn.Linear layer with a custom BitLinear layer that binarizes weights to +1 or −1 and quantizes activations to 8-bit integers, BitNet achieves competitive performance to full-precision (FP16) Transformers while dramatically reducing memory footprint and energy consumption. More provocatively, the authors demonstrate that BitNet follows a scaling law similar to full-precision models, suggesting that 1-bit architectures could scale to hundreds of billions of parameters without sacrificing the predictable performance improvements that make scaling worthwhile.

Read more »

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — In-Depth Technical Review (English)

Author: Zhongzhu Zhou Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 2020 / arXiv 2019) ArXiv: https://arxiv.org/abs/1910.02054 Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He (Microsoft)


Abstract

ZeRO (Zero Redundancy Optimizer) is one of the most important systems papers in the large-model training era. It fundamentally rethinks how memory is consumed during distributed deep learning training and proposes a family of optimizations that eliminate redundant memory storage across data-parallel processes — without sacrificing computational efficiency. The result is staggering: ZeRO enables training of models with over 100 billion parameters on 400 GPUs with super-linear speedup, achieving 15 PetaFlops of throughput. This represents an 8× increase in trainable model size and 10× improvement in throughput over the state-of-the-art at the time. Perhaps most importantly, ZeRO democratizes large model training: it allows data scientists to train models with up to 13 billion parameters using nothing more than standard data parallelism — no model parallelism, no pipeline parallelism, no model refactoring required. ZeRO is the backbone of Microsoft's DeepSpeed library and powered Turing-NLG, which at the time was the world's largest language model (17B parameters).

Read more »

MetaGPT — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: Meta Programming for a Multi-Agent Collaborative Framework (arXiv 2023/2024)
ArXiv: https://arxiv.org/abs/2308.00352


Abstract

MetaGPT is an early but still very influential paper in the modern “LLM agents building software” story. Its central claim is that simply letting several language-model agents chat with each other is not enough for reliable complex work, because unconstrained dialogue amplifies ambiguity, hallucination, and wasted tokens. The paper proposes a more disciplined alternative: organize LLM agents like a software company, give them specialized roles, force them to exchange structured artifacts instead of casual conversation, and run the workflow through Standard Operating Procedures (SOPs). On code-generation benchmarks such as HumanEval and MBPP, and on a custom SoftwareDev benchmark, MetaGPT reports stronger results than prior multi-agent systems and argues that the gain comes from workflow engineering as much as from model capability. I think the paper matters because it reframed multi-agent prompting from “more agents means more intelligence” into “the right interfaces, handoff documents, and execution discipline matter at least as much as the number of agents.”

Read more »

FlashAttention — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv 2022 / NeurIPS 2022)
ArXiv: https://arxiv.org/abs/2205.14135


Abstract

FlashAttention is one of the papers that changed how the community thinks about efficient Transformer attention. The core point is subtle but extremely important: many previous attempts to accelerate attention focused on reducing FLOPs, while real GPU runtime was often dominated by memory traffic rather than arithmetic. This paper argues that exact attention can be made much faster without changing model semantics if we redesign the algorithm around the GPU memory hierarchy instead of around the usual matrix formula alone. The result is an exact attention kernel that avoids materializing the full attention matrix in high-bandwidth memory, cuts memory usage from quadratic extra memory to linear extra memory, and delivers large end-to-end speedups in BERT, GPT-2, and long-context tasks. In my view, the paper matters because it turned “attention optimization” from a mostly mathematical approximation game into a systems problem with a rigorous IO model.

Read more »

LoRA: Low-Rank Adaptation of Large Language Models — In-Depth Technical Review (English)

Author: Zhongzhu Zhou
Paper: LoRA: Low-Rank Adaptation of Large Language Models (ICLR 2022)
ArXiv: https://arxiv.org/abs/2106.09685


TL;DR

LoRA freezes pretrained model weights and injects trainable low-rank matrices into selected linear layers. This simple reparameterization preserves downstream quality while reducing trainable parameters by orders of magnitude and cutting optimizer-state memory sharply. The practical consequence is that many task-specific adapters can be trained and deployed cheaply on top of one frozen base model. The design also opened a long line of adapter-style methods used in modern LLM fine-tuning pipelines.

Read more »

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — In-Depth Technical Review (English)

Author: Steve
TL;DR: I think this paper is one of the clearest “systems bridge” papers in large-model training. Its core contribution is not a brand-new model architecture, but a training recipe for making very large Transformers practical: use tensor parallelism inside a node, pipeline parallelism across nodes, and data parallelism across replicas, then make the composition efficient with a better pipeline schedule and communication-aware engineering.
Estimated reading time: 28–35 minutes
Paper: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021)
ArXiv: https://arxiv.org/abs/2104.04473

Read more »

Rethinking Memory and Communication Costs for Efficient Large Language Model Training — In-Depth Technical Review (English)

Author: Steve
Paper: Rethinking Memory and Communication Costs for Efficient Large Language Model Training (arXiv 2310.06003, 2023)
ArXiv: https://arxiv.org/abs/2310.06003
TL;DR: PaRO is a practical systems paper that rebalances ZeRO-style sharding by allowing partial redundancy to cut expensive cross-group communication, then adds HO-Ring to improve inter-node collective efficiency. The key result is 1.19×–2.50× throughput gain over prior baselines plus 36.5% communication efficiency gain over standard Ring in their setting.


Read more »

Speculative Decoding — In-Depth Technical Review (English)

Author: zhongzhu zhou
Paper: Fast Inference from Transformers via Speculative Decoding (ICML 2023 Oral)
ArXiv: https://arxiv.org/abs/2211.17192


TL;DR (1-minute version)

Speculative decoding is a way to make large language model (LLM) inference faster without changing the final output distribution of the target model. The key trick is simple but powerful: use a much smaller, faster draft model to “guess” several future tokens, then ask the large target model to verify those guesses in one parallel pass. If a guessed token is valid under a mathematically correct acceptance rule, keep it; if not, correct at the first mismatch and continue.

Read more »

InstructGPT (2203.02155) — Technical Review

TL;DR (1 minute): InstructGPT is the paper that turned “next-token prediction models” into “helpful assistant models” at scale. The core pipeline is simple but powerful: (1) supervised fine-tuning on human-written demonstrations, (2) reward-model training from pairwise preference data, and (3) PPO optimization against that reward while constraining drift from the base model via a KL penalty. I think this paper’s long-term contribution is not just better outputs, but a production training recipe that changed how almost all modern assistant models are built.

Estimated reading time: 45–60 minutes

Read more »

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation — In-Depth Technical Review (English)

Author: zhongzhu zhou
Paper: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (COLM 2024)
ArXiv: https://arxiv.org/abs/2308.08155


TL;DR: AutoGen is a practical framework for building multi-agent LLM systems by modeling each role (planner, coder, reviewer, tool user, human proxy) as a conversational agent. The key idea is not only “more agents,” but programmable interaction patterns that allow humans and tools to be inserted where reliability matters. It improves developer productivity and flexibility, but it does not magically solve verification, cost control, or loop safety.
Estimated reading time: 30–40 minutes

Read more »