0%

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Review date: 2026-05-11
Review author: Zhongzhu Zhou
Paper reviewed: MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems
Paper authors: Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang
arXiv: 2605.06623v1, 2026-05-07
Venue/status: ICML 2026
Source used for this review: src/related-documents/papers/2605.06623-MASPO.pdf


Short answer

MASPO studies a practical problem that appears once we stop treating a multi-agent LLM system as a hand-written prompt demo and start treating it as an object that should be optimized: how do we improve the prompts of several interacting agents at the same time? In a single-agent setting, prompt optimization can score one prompt against one final output. In a multi-agent system, the situation is messier. An upstream agent may produce an answer that looks locally reasonable, but that answer may mislead a downstream reflector, router, verifier, or code writer. The system fails even though one component appears to have done its own job.

The paper calls this failure mode local-global misalignment. MASPO is built around the idea that prompt optimization for multi-agent systems needs to judge an agent by three signals at once:

  1. whether the target agent's own output improves;
  2. whether immediate successor agents become more successful when they consume that output;
  3. whether the final system-level answer improves.

The method combines these signals into a joint reward model, mines the cases where local success still causes downstream failure, and uses those cases to guide an evolutionary beam search over role-specific prompts. It also refreshes beam scores when other agents change, because the context distribution seen by each agent shifts during optimization.

The main empirical claim is clear. On six math, reasoning, and code benchmarks, MASPO improves both sequential and hierarchical multi-agent systems over vanilla prompting, CoT, self-consistency, Self-Refine, AgentDropout, TPE-style search, and single-agent prompt optimization adapted to the multi-agent setting. The headline average accuracy is 70.39 for Sequential MAS + MASPO and 71.05 for Hierarchical MAS + MASPO. The paper reports an average accuracy improvement of 2.90 points over the best prompt-optimization baselines, and larger gains on difficult coordination-sensitive tasks such as GPQA.

My main takeaway is that MASPO is less about inventing a magical prompt and more about defining the right optimization unit. In a multi-agent workflow, a prompt is not just a string that makes one model answer better. It is part of an information pipeline. The useful question is therefore not only "did this agent answer well?" but also "did this agent produce the kind of intermediate state that helps the next agents succeed?"


1. Prerequisites

1.1 What an LLM-based multi-agent system is

An LLM-based multi-agent system is a workflow where multiple model calls are assigned different roles and connected through a communication structure. One agent may propose an answer, another may critique it, another may generate code, and another may verify the final output. The key point is that the agents are not independent. The output of one agent becomes part of the input context of another.

A simple two-agent example is:

1
2
3
4
5
6
7
Question
|
v
Solver agent: produces a candidate solution
|
v
Reflector agent: checks the candidate and returns final answer

A larger system may look like this:

1
2
3
4
5
6
7
8
9
10
        Planner
/ \
v v
Specialist A Specialist B
\ /
v v
Verifier
|
v
Final answer

This structure can help because different prompts can encourage different behaviors: planning, derivation, criticism, code synthesis, verification, or summarization. But it also creates a coordination problem. The final answer depends not only on each node's local skill, but also on whether the intermediate outputs are useful to the next nodes.

The MASPO paper formalizes a multi-agent system as a directed communication graph. Each node is an agent, each edge says whose output is passed to whom, and each agent has a role-specific prompt. The prompts are the parameters MASPO tries to optimize.

1.2 Why prompt optimization matters

Manual prompt writing can work for prototypes, but it does not scale gracefully. A prompt that sounds reasonable to a human may fail on edge cases, and a prompt that works for one task may be brittle on another. Prompt optimization tries to automate this process: generate candidate prompts, test them, keep the better ones, and iterate.

For a single model call, the objective is relatively easy to state:

1
candidate prompt -> model answer -> score against target

The score can come from a ground-truth label, a unit test, a rubric, a human preference model, or an LLM judge. Methods such as TPE search, evolutionary prompt search, TextGrad-style textual feedback, and self-supervised prompt optimization all fit this broad pattern.

Multi-agent prompt optimization is harder because a prompt may affect the final answer indirectly. Suppose a first agent writes a long explanation. Locally, the explanation may be correct. But if it is too verbose, ambiguous, or badly formatted, a downstream verifier may fail to extract the final answer. In that case, local scoring says "good", but system scoring says "bad". MASPO is motivated by precisely this gap.

1.3 Credit assignment in a multi-step reasoning chain

Credit assignment means deciding which component deserves credit or blame for a final outcome. In reinforcement learning, it appears when a reward arrives at the end of a long trajectory. In a multi-agent LLM workflow, the same issue appears when the final answer is wrong. Was the planner wrong? Did the solver compute incorrectly? Did the verifier over-correct a correct answer? Did a router send the problem to the wrong specialist?

A final-answer score alone is sparse. It says whether the system succeeded, but not why. A local score alone is narrow. It says whether one agent obeyed its role, but not whether that role output helped the collaboration. MASPO's joint evaluation tries to fill the middle by evaluating immediate downstream effects.

The paper's three-level reward can be read as:

1
2
3
local validity      -> did this agent improve its own output?
lookahead potential -> did the next agents do better with this output?
global alignment -> did the final system answer improve?

This decomposition is the conceptual core of the paper. It converts prompt optimization from a local string-search problem into a causal-chain optimization problem.

1.4 Topology and non-stationarity

The topology of a multi-agent system is the communication graph: which agent talks to which other agent. A sequential topology is a chain. A hierarchical topology may have multiple branches feeding into a final aggregator. Some systems also prune, route, or dynamically select agents.

When prompts are optimized one by one, the environment for each agent changes. If agent 1 is updated, the context seen by agent 2 changes. If agent 2 is updated, the value of agent 1's old output may also change. This is a form of non-stationarity: the scoring landscape for a prompt is not fixed while other prompts evolve.

That is why MASPO uses two scheduling ideas:

  • topological optimization, so agents are optimized in an order that respects causal dependencies;
  • beam refresh, so stored prompt candidates are re-scored when the surrounding agents have changed.

Without refresh, a candidate prompt may look good only because it was evaluated with old upstream outputs. After the rest of the system shifts, its old score can become stale.

1.5 Evolutionary beam search in plain terms

Beam search keeps several promising candidates instead of committing to one candidate at every step. Evolutionary search mutates or rewrites candidates to generate offspring. MASPO combines these ideas: for each target agent, it keeps a small beam of prompt candidates, generates variations using execution traces, scores the variations, and keeps the top candidates.

The search is not blind. Candidate prompts are generated from actual traces: the query, incoming context, and agent output. MASPO also injects hard cases from a misalignment buffer. These are examples where an agent appears locally successful but the system fails downstream. That makes the optimizer focus on the most informative failure cases instead of only reinforcing easy successes.


2. What this paper does

MASPO proposes a framework for joint prompt optimization in LLM-based multi-agent systems. The paper's starting point is that role-specific prompts determine not only what each agent says, but also how agents coordinate. Optimizing these prompts independently misses cross-agent dependencies.

The paper contrasts three approaches:

1
2
3
4
5
6
7
8
9
10
11
12
Manual MAS design
Human writes role prompts and topology.
Good for prototypes, brittle across tasks.

Single-agent prompt optimization adapted to MAS
Optimize one prompt using local or final-output feedback.
Easier, but weak at credit assignment.

MASPO
Optimize prompts with local, lookahead, and global signals.
Mine local-global misalignment cases.
Use adaptive beam search under multi-agent non-stationarity.

The main workflow, shown in Figure 1 of the paper, is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sample execution traces
|
v
prompt optimizer proposes candidate prompts
|
v
LLM evaluator compares candidate vs reference behavior
|
v
joint reward = local + lookahead + global signals
|
v
misalignment cases are stored and reused
|
v
evolutionary beam search keeps stronger prompt candidates
|
v
beam refresh re-scores candidates after other agents evolve

The important design choice is that MASPO evaluates an intermediate agent by its causal contribution to the collaboration. This is different from merely checking whether the agent followed its local instruction. For example, a reflector prompt may be locally good if it finds many possible mistakes, but globally bad if it over-corrects and destroys correct solutions. A code-review prompt may be locally detailed but globally bad if it returns unparseable code. MASPO tries to detect those gaps and optimize against them.

The paper's contribution can be summarized in three pieces:

  1. Multi-granularity joint evaluation. A prompt candidate is scored by local validity, lookahead potential, and global alignment.
  2. Misalignment-driven generative search. The optimizer is fed examples where local correctness does not translate into downstream success.
  3. Adaptive optimization dynamics. A coordinate-ascent-like topological schedule and beam refresh mechanism handle the fact that other agents keep changing.

I like the framing because it matches how real agent workflows fail. In production, the painful failures are often not "agent A was obviously wrong". They are "agent A produced a plausible intermediate artifact that caused agent B to go in the wrong direction". MASPO makes that failure mode first-class.


3. Method details

3.1 Formalizing the multi-agent system

The paper models the multi-agent system as a directed graph G=(V,E)G=(V,E). Each node viv_i is an agent. Each directed edge (vj,vi)(v_j, v_i) means that the output of agent vjv_j is used as input context for agent viv_i.

Each agent has:

  • an LLM-based inference function fif_i;
  • a role-specific prompt pip_i;
  • incoming context CiC_i from predecessor agents;
  • an output oio_i.

The generation process is written as:

oi=fi(pi,q,Ci),Ci=vjNin(vi)oj.o_i = f_i(p_i, q, C_i), \quad C_i = \bigoplus_{v_j \in N_{in}(v_i)} o_j.

In words: the agent receives the original query plus the concatenated outputs of its predecessors, then produces its own output. The full prompt configuration is P={pi}i=1NP=\{p_i\}_{i=1}^{N}. The final system output is a composite function of the graph, prompts, and query:

oglob=Φ(G,P,q).o_{glob} = \Phi(G, P, q).

The objective is to find prompts that maximize expected final performance. But directly optimizing over all natural-language prompt strings for all agents is combinatorial, non-differentiable, and non-stationary. MASPO addresses this by optimizing agents in a topological schedule and using LLM-based comparative evaluations as a proxy reward.

3.2 Topological context and trace-guided proposal

MASPO does not try to fully optimize all prompts simultaneously. Instead, it uses a coordinate-ascent-style schedule. It visits agents in topological order, optimizes one target agent for a limited number of generations, freezes it, moves to successor agents, and repeats this for several rounds.

The paper emphasizes that this is not ordinary one-pass sequential tuning. A single pass would overfit upstream agents to the initial behavior of downstream agents. MASPO revisits agents for multiple rounds so the prompts can co-adapt.

For a target agent, the optimizer collects execution traces:

1
trace = (query q, incoming context C, target-agent output o)

These traces are used as few-shot evidence for the optimizer model. The optimizer model is asked to inspect how the current prompt behaves and propose candidate prompt variants. This is more grounded than random prompt mutation because the proposed changes are tied to concrete successes and failures.

The candidate generation step can be read as:

1
parent prompt + sampled traces -> optimizer model -> candidate prompts

The paper uses Gemini-2.5-Pro as the optimizer model in the main setup. Candidate generation uses temperature 0.7 to encourage diversity, while evaluation uses temperature 0 for more stable judgments.

3.3 Multi-granularity joint reward

The joint reward is the core technical component. For a candidate prompt pp' compared against a reference prompt pp, MASPO evaluates the target agent on a batch of examples and combines three preference indicators.

A simplified version of the reward is:

R=αRlocal+βRlookahead+θRglobal.R = \alpha \cdot R_{local} + \beta \cdot R_{lookahead} + \theta \cdot R_{global}.

The three terms mean:

  • Local Validity. Does the target agent's own output improve under the candidate prompt?
  • Lookahead Potential. Do immediate successor agents produce better outputs when fed the target agent's new output?
  • Global Alignment. Does the final system response improve?

The full paper writes this at the sample level using preference comparisons such as oioio'_i \succ o_i. These comparisons are produced by an evaluator model. The evaluator does not need a ground-truth label for every intermediate output. Instead, it compares whether one output is preferable to another under a role-specific rubric.

This design is useful because intermediate agents often do not have direct labels. A planner's output, a reflector's critique, or a summarizer's context package may not have a canonical target answer. But we can still ask whether the downstream agents did better after receiving it.

A concrete example helps:

1
2
3
4
Target agent: code planner
Local output: a plan for solving a Python problem
Successor: code writer
Final system: executable function

A locally nice plan may be too abstract to help the code writer. Local validity alone may reward it. Lookahead potential asks whether the code writer actually produces better code with that plan. Global alignment asks whether the final code passes the benchmark.

That middle signal is MASPO's most important addition.

3.4 Mining local-global misalignment cases

MASPO defines a misalignment case as a case where the target agent improves locally but fails to help the downstream or final outcome. In plain language:

1
2
agent output looks better by its own role standard
but successor agents or final system do not improve

The paper's condition is roughly:

Local=1and(Lookahead=0  or  Global=0).Local=1 \quad \text{and} \quad (Lookahead=0 \;\text{or}\; Global=0).

These cases are stored in a buffer BmisB_{mis}. During later candidate generation, the optimizer samples from this buffer and sees the hard negatives. This matters because successful traces often have low diagnostic value. They show what worked, but not where coordination broke. Misalignment cases expose the exact gap between "locally valid" and "system useful".

This is one of the more practical ideas in the paper. Many LLM-agent failures are not random; they are repeated interface failures between roles. For example:

  • the planner omits constraints that the solver later needs;
  • the solver writes an answer but does not mark the final value in a parseable format;
  • the reflector criticizes too aggressively and changes a correct answer;
  • the code agent includes comments or imports when the benchmark expects a bare function;
  • the verifier focuses on surface formatting rather than logical correctness.

A misalignment buffer gives the optimizer a memory of these interface failures.

For each target agent, MASPO maintains a beam of top prompt candidates. Each parent prompt generates candidate offspring. Each candidate is evaluated by the joint reward, and the candidate's cumulative score is updated:

J(p)=R(p,pparent;B)+J(pparent).J(p') = R(p', p_{parent}; B) + J(p_{parent}).

The beam keeps the top candidates. This gives the search two benefits:

  1. It avoids committing too early to one prompt trajectory.
  2. It accumulates evidence across multiple sampled batches, reducing noise from one batch.

The paper uses a small beam width K=2K=2 and generates Ksub=2K_{sub}=2 candidate variations for each parent. This is intentionally modest. The goal is not brute-force enumeration. It is to make each candidate generation informed by traces and each evaluation informed by the multi-agent reward.

3.6 Beam refresh for non-stationary contexts

Beam refresh handles a subtle but important problem. Suppose agent 2 has a prompt candidate that scored well when agent 1 used an old prompt. Later, agent 1 is improved. Now agent 2 receives a different context distribution. The old score for agent 2's candidate may no longer be valid.

MASPO refreshes the beam when an agent is revisited. It re-evaluates stored candidates against the current global best prompt under current contexts and centers the score around a baseline:

Jnew(p)=R(p,pbest;B)0.5.J_{new}(p) = R(p, p_{best}; B) - 0.5.

The exact centering is less important than the idea: old scores should not be blindly trusted after peer agents evolve. The ablation study supports this. Removing beam refresh reduces the Sequential MAS average from 70.39 to 68.53.

Figure 1 in the paper is helpful because it shows all of these pieces together: traces feed the optimizer, candidates feed the evaluator, the evaluator produces local/lookahead/global signals, misalignment cases are mined, and beam search chooses the next prompts.


4. Experiment setup

4.1 Models

The main experiments use Qwen3-8B as the backbone model for the agents. The paper states that the agents are configured in standard non-thinking inference mode to avoid confounding gains from intrinsic reasoning enhancements.

The optimizer and evaluator modules use Gemini-2.5-Pro in the main setup:

  • optimizer model temperature: 0.7;
  • evaluator model temperature: 0;
  • agent inference temperature: 0.

The authors also test a self-optimized setting where Qwen3-8B itself is used as optimizer and evaluator. This is important because Gemini-powered optimization may be expensive or unavailable in some settings.

4.2 Benchmarks

The evaluation covers six datasets across three categories:

Mathematical proficiency

  • MATH-500
  • AGIEval-MATH Level-5 subset
  • AQuA

Complex reasoning

  • GPQA-Diamond

Code generation

  • MBPP
  • HumanEval-ET

This mix is reasonable for the paper's goal. Math and GPQA stress multi-step reasoning; MBPP and HumanEval-ET stress code generation and formatting; all three categories are places where multi-agent solver-reflector workflows are common.

4.3 Baselines

The paper compares against both single-agent and multi-agent baselines.

Single-agent or non-optimized baselines include:

  • Vanilla prompting;
  • Chain-of-Thought;
  • self-consistency over CoT;
  • Self-Refine;
  • AgentDropout.

Multi-agent baselines include:

  • Sequential MAS;
  • Hierarchical MAS;
  • TPE-based optimization adapted from MIPRO/MASS-style search;
  • SPO, a self-supervised prompt optimizer adapted to the multi-agent setting.

This baseline set is useful because it separates several questions:

1
2
3
4
Does MAS help over single-agent prompting?
Does prompt optimization help over manual MAS prompts?
Does MASPO beat generic prompt optimizers?
Does MASPO work across different MAS topologies?

4.4 MASPO hyperparameters

The default MASPO setup uses:

  • sample pool size D=50|D|=50;
  • mini-batch size B=10|B|=10;
  • beam width K=2K=2;
  • two candidate variations per parent prompt;
  • maximum retrieved misalignment cases Kmis=3K_{mis}=3;
  • joint reward weights α=0.4\alpha=0.4, β=0.4\beta=0.4, θ=0.2\theta=0.2;
  • topological step size T=3T=3;
  • number of rounds D=3D=3.

These values matter because the paper's practical pitch is that MASPO can work with a small unlabeled sample pool. It is not claiming to train a new model or collect large human preference datasets. It is optimizing prompt strings using a small evaluation set and LLM-based comparative judgment.


5. Results and analysis

5.1 Table 1: main performance comparison

Table 1 is the main result table. The most important rows are:

1
2
3
4
5
6
7
8
9
Sequential MAS baseline average:       65.31
Sequential MAS + TPE average: 66.49
Sequential MAS + SPO average: 66.56
Sequential MAS + MASPO average: 70.39

Hierarchical MAS baseline average: 68.32
Hierarchical MAS + TPE average: 68.47
Hierarchical MAS + SPO average: 69.01
Hierarchical MAS + MASPO average: 71.05

This shows two things.

First, generic prompt optimization is not enough. TPE and SPO provide only modest gains in the multi-agent setting. They do not directly model the local-global credit assignment issue.

Second, MASPO improves both tested topologies. This is important because the method is not tied to a single chain architecture. The paper reports that MASPO improves Sequential MAS average accuracy by 5.06 points over its baseline and Hierarchical MAS by 2.73 points over its baseline.

The largest visible gains appear on coordination-sensitive tasks. For Sequential MAS, GPQA rises from 47.73 to 58.08 with MASPO. HumanEval-ET rises from 68.90 to 73.78. These are exactly the kinds of tasks where better intermediate critique, planning, and formatting can change final outcomes.

5.2 Why the GPQA result is especially interesting

GPQA-Diamond is a difficult reasoning benchmark. It is not just a formatting task. A system often needs to maintain a chain of domain-specific reasoning and avoid being misled by plausible distractors.

In Table 1, Sequential MAS + SPO reaches 49.52 on GPQA, while Sequential MAS + MASPO reaches 58.08. That gap is large relative to the other columns. My interpretation is that GPQA exposes the failure of local prompt optimization. A reflector can look locally articulate while pushing the system toward a wrong option. A planner can generate an apparently detailed path that hides a flawed premise. MASPO's lookahead and global signals are more likely to penalize those failures.

5.3 Table 2: ablation study

Table 2 is the most useful table for understanding why the method works. The full Sequential MAS + MASPO average is 70.39. Several ablations reduce performance:

1
2
3
4
5
6
7
Serial Search:                  68.10
Single Cycle: 68.19
w/o Beam Refresh: 68.53
w/o Joint Evaluate: 67.77
w/o Misalignment Sampling: 69.68
w/ Success-Case Sampling: 69.29
Self-Optimized with Qwen3-8B: 67.70

The biggest drop comes from removing joint evaluation. That supports the paper's main thesis: the method needs multi-granularity scoring, not only better search. Beam refresh and multi-round scheduling also matter, which supports the non-stationarity argument.

The success-case sampling result is also revealing. Sampling successful traces performs worse than sampling misalignment cases. This makes intuitive sense. Successful examples can reinforce what the system already does well. Misalignment examples show where the interface between agents is broken.

5.4 Figure 2 and Table 4: local and lookahead signals matter more than a sparse final signal

Figure 2 visualizes the performance landscape for joint reward weights. Table 4 gives detailed numbers. The best reported configuration is:

1
2
3
4
local validity      alpha = 0.4
lookahead potential beta = 0.4
global alignment theta = 0.2
average accuracy = 70.39

The interesting lesson is that the final system signal is not useless, but it is not enough. The global signal is sparse and noisy for intermediate agents. Local and lookahead signals provide denser feedback. In other words, MASPO needs to know both whether the target agent did its own job and whether it helped its immediate successors.

I read this as a good empirical validation of the paper's credit-assignment story. If global outcome alone were sufficient, configurations with heavy global weight would dominate. They do not.

5.5 Figure 3: misalignment rate falls during optimization

Figure 3 tracks the misalignment rate across optimization depth for math, reasoning, and code. MASPO reduces the misalignment rate more consistently than variants without the full mechanism.

This is important because it shows a mechanism-level effect, not only a final benchmark number. The method is not merely finding prompts that happen to score higher. It is reducing the specific failure pattern it was designed to target: cases where local success fails to translate into downstream utility.

5.6 Table 3: optimized prompts transfer across stronger models

Table 3 tests cross-model transfer. Prompts optimized on Qwen3-8B are applied to stronger model backbones without further fine-tuning:

1
2
3
4
DeepSeek-V3 MAS:          71.79 -> 75.86
GLM-4.6 MAS: 75.61 -> 78.41
Claude-Sonnet-4 MAS: 77.58 -> 79.73
Gemini-2.5-Pro MAS: 84.93 -> 87.14

This is one of the more practically valuable findings. It suggests that MASPO is not only learning quirks of Qwen3-8B. It may be learning better interaction protocols: how solvers should explain, how reflectors should verify, how code agents should format, and how final answers should be surfaced.

For deployment, this hints at a cost-saving workflow:

1
2
optimize prompts with a cheaper model
then reuse the optimized prompts with stronger serving models

Of course, the paper does not prove this will hold for every workflow, but the result is encouraging.

5.7 Table 5: MASPO complements topology optimization

Table 5 applies MASPO on top of AgentDropout, a method that optimizes the communication topology by pruning agent interactions. AgentDropout alone averages 66.89. AgentDropout + MASPO averages 70.46.

This supports a useful distinction:

  • topology optimization decides who communicates with whom;
  • prompt optimization decides what each agent should do and how it should communicate.

Those are orthogonal dimensions. A better graph can still have poor role prompts. Better prompts can still help within a pruned graph.

5.8 Table 6: sample pool size saturates around 50

Table 6 varies the sample pool size:

1
2
3
30 samples: 69.13 average
50 samples: 70.39 average
70 samples: 70.43 average

The improvement from 50 to 70 is tiny. The authors use this to justify the default pool size of 50. This is a practical result because it keeps optimization data requirements modest. The real cost is not labeling many examples; it is running repeated LLM evaluations during search.


6. What I think the paper gets right

6.1 It targets the right failure mode

The most convincing part of MASPO is the problem framing. Multi-agent systems fail through interfaces. One agent's output is another agent's context, and the interface between them can be fragile. A prompt optimizer that ignores that interface is likely to over-optimize local behavior.

MASPO's lookahead potential is a simple but effective way to turn the interface into an evaluated object. This is a good design principle beyond this paper. When optimizing agent workflows, we should score intermediate artifacts by how useful they are to their consumers, not only by how well they satisfy a local rubric.

6.2 It avoids relying entirely on final labels

The method does use final outcomes, but it does not require dense ground truth for every intermediate output. That makes it usable for real workflows where intermediate labels are unavailable. In many agent systems, there is no gold label for "the ideal planner note" or "the ideal reflection". There may only be a final task success signal.

MASPO handles this by comparing candidate and reference outputs through an evaluator model. This is not perfect, but it is realistic.

6.3 It treats prompt search as a dynamic process

The beam refresh mechanism is a small but important detail. In many optimization papers, the environment is assumed fixed. In multi-agent prompt optimization, the environment for each prompt changes as other prompts improve. Re-scoring old candidates under current peer prompts is the right instinct.

6.4 It provides evidence beyond one benchmark family

The paper evaluates math, scientific reasoning, and code generation. This makes the claims stronger than a single-domain demonstration. The transfer experiments and AgentDropout integration also help show that the method is not purely topology-specific.


7. Limitations and boundary conditions

7.1 LLM-as-evaluator reliability remains a key assumption

MASPO relies on an evaluator model to compare local outputs, lookahead outputs, and final outputs. If the evaluator is biased, inconsistent, or unable to judge a domain, the reward becomes noisy. The paper uses Gemini-2.5-Pro for the main setup, which is strong, but that also means the optimization pipeline inherits Gemini's judgment quality and cost.

The self-optimized Qwen3-8B variant still improves over baselines, but its average drops to 67.70, far below the full 70.39. This suggests the method is robust enough to work with a weaker evaluator, but the evaluator quality still matters.

7.2 The cost model is not fully developed

The paper discusses small sample pools and limited batches, but it does not provide a detailed wall-clock or API-cost model for the full optimization process. In practice, running an optimizer model and evaluator model across multiple agents, beams, candidate prompts, batches, and rounds can be expensive.

A production user would need to know:

  • total optimizer calls;
  • total evaluator calls;
  • average tokens per trace;
  • wall-clock time per benchmark;
  • sensitivity to number of agents;
  • whether optimized prompts amortize the cost over enough future traffic.

The method is most attractive when prompts are reused many times after optimization.

7.3 The evaluated topologies are still relatively structured

Sequential and hierarchical MAS are important, but many real agent systems include dynamic routing, tool calls, retrieval, memory writes, external APIs, and conditional branches. MASPO's graph-based formalism can describe many DAG-like structures, but practical workflows may be messier.

A useful future test would apply MASPO to an agentic software engineering system with tool execution, repository exploration, patch generation, and test feedback. The paper's benchmark choices include code generation, but not full long-horizon software engineering with tools.

7.4 Prompt optimization may overfit to benchmark interfaces

Some optimized prompts in Appendix E include very specific formatting rules, such as answer tags or no type hints. These rules are valuable for benchmarks, but a practitioner should distinguish two kinds of improvement:

1
2
real reasoning improvement
benchmark-interface improvement

Both can matter. If a system fails because the final answer is not parseable, fixing the format is legitimate. But if the goal is general reasoning quality, we should be careful not to confuse benchmark compliance with deeper capability.

7.5 Misalignment mining depends on the definition of local success

A misalignment case is only as good as the local evaluator. If local validity is judged incorrectly, the buffer may collect the wrong failures. For example, an output may look locally valid but contain a subtle false premise. The system failure would then be labeled as local-global misalignment, when the true issue was local correctness.

This does not break the method, but it means the evaluator prompts and rubrics are central engineering artifacts.

7.6 Safety and adversarial prompting are not deeply studied

Optimized prompts may become more forceful, more rigid, or more benchmark-specific. In open-ended agent systems, that could create safety issues: overconfident verification, hidden prompt injection sensitivity, or agents that ignore uncertainty. The paper's impact statement says no unique ethical issue is foreseen, but deployment in tool-using agents would still require safety evaluation.


8. Reproducibility and practical notes

8.1 Code availability

The paper says code is released at:

1
https://github.com/wangzx1219/MASPO

That is a positive sign for reproducibility. The paper also provides optimizer prompts, evaluator prompts, the MASPO algorithm, and examples of optimized prompts in the appendices.

8.2 What is needed to reproduce the paper

To reproduce the main results, one would need:

  • the MASPO codebase;
  • benchmark loaders for MATH-500, AGIEval-MATH, AQuA, GPQA-Diamond, MBPP, and HumanEval-ET;
  • Qwen3-8B inference in non-thinking mode;
  • access to Gemini-2.5-Pro or a replacement optimizer/evaluator;
  • exact initial prompts and topology definitions from Appendix D;
  • evaluator prompts from Appendix B;
  • optimizer prompts from Appendix A;
  • enough budget for repeated candidate generation and evaluation.

The paper gives many of these details. The largest reproducibility risk is likely not the algorithm itself, but differences in external LLM APIs, model versions, evaluator behavior, and benchmark parsing.

8.3 How I would try MASPO in a real workflow

If I were applying MASPO to a practical agent system, I would start with a small workflow and a small validation set:

1
2
3
4
5
6
7
8
9
1. Define the agent graph explicitly.
2. Write clear role prompts for each agent.
3. Collect 30-50 representative tasks.
4. Log full traces: query, context, each agent output, final result.
5. Define local rubrics for each role.
6. Define final success metrics.
7. Run MASPO for a few rounds.
8. Inspect optimized prompts manually before deployment.
9. A/B test against the old prompts.

I would not blindly deploy optimized prompts. Prompt optimizers can discover brittle shortcuts. Human inspection is especially important for tool-using agents, code agents, or workflows with user data.

8.4 Practical design lessons

The paper suggests several reusable design lessons:

  • Score intermediate artifacts by downstream usefulness.
  • Store hard coordination failures, not only final failures.
  • Re-evaluate prompt candidates after other workflow components change.
  • Treat format requirements as part of the task, not as afterthoughts.
  • Optimize prompts and topology as separate but compatible dimensions.
  • Use smaller models for optimization when the optimized prompts transfer to larger models.

These lessons are likely useful even if one does not implement MASPO exactly.


9. Final assessment

MASPO is a strong paper because it identifies a real gap in current agent engineering. Many multi-agent systems are built by hand: pick roles, write prompts, connect them, test a few examples, and hope the collaboration generalizes. MASPO argues that this process can be optimized, but only if we evaluate prompts at the level of the collaboration rather than at the level of isolated strings.

The best idea is the joint reward, especially the lookahead term. It gives the optimizer a way to ask whether an upstream agent's output is useful to downstream agents. The misalignment buffer then turns repeated interface failures into training signal for prompt search. Beam refresh addresses the dynamic nature of co-adapting agents.

The paper is not the final answer to agent optimization. It still depends heavily on LLM evaluators, and its cost/robustness story needs more production-scale evidence. But it is a meaningful step from manual prompt craft toward systematic multi-agent workflow optimization.

If I had to summarize the paper in one sentence:

MASPO teaches prompt optimization to care not only about whether an agent sounds right, but whether its output makes the rest of the agent system work better.

References

  1. Zhexuan Wang et al. MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems. ICML 2026, arXiv:2605.06623.
  2. Opsahl-Ong et al. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. EMNLP 2024.
  3. Xiang et al. Self-Supervised Prompt Optimization. EMNLP 2025.
  4. Wang et al. AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration. ACL 2025.
  5. Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

Review written on 2026-05-11.