1. Long-Horizon Coherence
Question: As rollout horizon grows, do predictions remain usable?
Signature failure: Compounding error. Small per-step deviations ( per step) become total error after steps, pushing trajectories into impossible regions.
How to measure:
- Plot task success rate vs. horizon
- Look for graceful degradation (success drops smoothly) vs. cliff (success drops suddenly)
- Example: Does a robot successfully grasp objects in 5-step rollouts? 10 steps? 50 steps?
Diagnostic findings:
- Dreamer-based models typically remain coherent out to 50-100 steps for robotic manipulation
- Video generation models (Sora, Genie) struggle beyond 10-20 seconds (severe compounding error)
- Code reasoning (SWE-bench) requires coherence over hundreds of steps when fixing multi-file bugs
2. Intervention Sensitivity
Question: Does changing the action sequence produce meaningfully different trajectories?
Signature failure: Controllability failure. Model outputs the same trajectory regardless of action, making it useless for planning.
How to measure:
- Counterfactual divergence: From same initial state, execute two different action sequences; measure how much resulting trajectories differ
- Action sensitivity ratio: What fraction of action perturbations produce a detectable outcome change?
Example:
- In web automation: Inject a pop-up interrupt; does the agent replan or continue clicking blindly?
- In dialogue: Change one agent's opening move; does negotiation outcome shift?
- In robotics: Perturb object placement; does manipulation strategy adapt?
Current gap: Most benchmarks measure output quality (success rate, fidelity) but don't explicitly test action sensitivity. Closing this gap requires new evaluation protocols.
3. Constraint Consistency
Question: Do rollouts satisfy the governing laws throughout the entire trajectory?
Why this matters: Violations are often invisible per-step but catastrophic for planning.
Examples:
- Physical: Object trajectories violate gravity or penetrate obstacles → imagined success is impossible
- Digital: Browser predicts page loads, but actual API contract would fail (type mismatch, null return)
- Social: Model predicts negotiation success assuming user is price-sensitive, but user is actually quality-frustrated → plan fails
- Scientific: Predicted phase doesn't satisfy thermodynamic stability constraints → synthesis fails
How to measure:
- Physics: Check penetration depth, energy conservation, support-relation consistency
- Code: Verify type-constraint satisfaction, API receipt matching, exception handling
- Social: Detect norm violations, commitment consistency, Theory of Mind accuracy
- Science: Validate conservation law satisfaction, causal ordering, evidence-chain validity
Core Contribution 3: Unified Evaluation Framework
Beyond Prediction-Centric Evaluation
Traditional metrics focus on prediction accuracy: "Does the model predict the next frame well?"
But the paper argues this misses the point. A model with perfect next-frame prediction might fail at planning because:
- It doesn't compose coherently over many steps
- It's insensitive to action changes
- It violates domain constraints
The alternative: Decision-centric evaluation. Ask: "Does the model enable good decisions for downstream agents?"
The Minimal Reproducible Evaluation Package (MREP)
The paper proposes a lightweight evaluation protocol with three tiers:
Tier 1: Basic Capability Check
- Does the model make predictions at all?
- Does it respect the correct input/output shapes?
- Does it run without crashing?
Tier 2: Boundary Condition Verification
- Long-horizon coherence: Plot success vs. horizon curve
- Intervention sensitivity: Run action perturbation tests
- Constraint consistency: Check domain-specific violations
Tier 3: Decision-Centric Performance
- Can the model improve downstream agent performance?
- Does fine-tuning on agent-relevant regions help more than improving overall prediction accuracy?
- What's the sample efficiency gain from using the model vs. pure environment interaction?
Benchmark Coverage Gaps
The paper catalogs existing benchmarks and identifies major gaps:
Well-covered:
- Physical robotics (RoboCasa, ManiSkill3, MetaWorld)
- Some video generation (VBench for Sora)
- Code agents (SWE-bench)
- Embodied AI (Minecraft, Crafter)
Under-evaluated:
- Social simulation (only Sotopia; needs more domains)
- Scientific discovery (few benchmarks beyond climate/drug discovery)
- Cross-regime transfer (when does knowledge from one regime help in another?)
- Safety and calibration under distribution shift
Architecture and Implementation Guidance
Building Blocks Across Regimes
The paper identifies common architectural patterns:
State Representation:
- Bottleneck architectures (learned latent codes): Compress observations to low-dim codes, predict codes, decode back to observations
- Hierarchical representations: Different levels of abstraction for different time scales (immediate pixel changes vs. object trajectories vs. goals)
- Modular representations: Separate channels for position, velocity, appearance, lighting
Dynamics Model:
- Autoregressive: Predict each future step conditioned on previous predictions (classic but suffers compounding error)
- Non-autoregressive: Predict full trajectory at once (faster but harder to condition on actions)
- Latent dynamics: Predict in learned latent space (can be more stable)
Action Conditioning:
- Concatenation: Append action to state before prediction
- Multiplicative gating: Learned interaction between state and action
- Hierarchical planning: Abstract high-level actions into low-level dynamics
Design Tradeoffs by Regime
Physical World
- Favor: Explicit physics priors (Lagrangian mechanics, contact constraints)
- Avoid: Pure learning from pixels (unless data abundant); insufficient for long-horizon planning
- Sweet spot: Hybrid—learn what physics doesn't capture (material properties, deformations) while enforcing conservation laws
Digital World
- Favor: Symbolic execution (compose known API behaviors); constraint solvers
- Avoid: Pure neural prediction (APIs are discrete and deterministic; neural models are brittle)
- Sweet spot: Neural models for understanding (parsing intent, inferring unobserved state) + symbolic engines for composition
Social World
- Favor: Language models for dialogue generation; explicit Theory of Mind models
- Avoid: Purely behavioral imitation (loses interpretability of agent models)
- Sweet spot: LLM-based rollout with learned social belief updating
Scientific World
- Favor: Physics-informed neural networks (PINN), operator learning (DeepONet), Bayesian surrogate models
- Avoid: Pure black-box learning (need interpretability and uncertainty quantification for hypothesis-driven experiments)
- Sweet spot: Surrogate models with uncertainty + active learning for new experiments
Failure Modes and Limitations
Beyond the boundary-condition failures (compounding error, controllability, constraint violation), the paper identifies broader challenges:
L1 Failures
- Mode averaging: Multiple plausible futures collapse into blurry average (partially addressed by VAEs, diffusion models)
- Stochasticity: True randomness hard to capture in deterministic neural models
- Long-tail events: Rare scenarios poorly represented in training data
L2 Failures
- Distribution shift: Model works on training regime but fails on slight variations
- Exploitation: Agent finds "cheats" that work in simulation but violate constraints (e.g., walking through walls, using impossible API calls)
- Insufficient compositionality: Single predictors don't combine smoothly; joint training required
L3 Failures
- Attribution ambiguity: Which component of the model failed? (friction? contact model? object representation?)
- Overcorrection: Updating model to fix one failure case creates new failures elsewhere
- Feedback loops: If model guides agent exploration, data becomes biased; agent avoids regions model is uncertain about
State-of-the-Art Systems
By Application Domain
Robotics: MuZero → Dreamer → LEXA
- MuZero learns abstract dynamics for value estimation
- Dreamer adds visual fidelity + RL from imagination
- LEXA adds long-horizon exploration guided by learned models
Code/Web Agents: TextRL → SWE-agent → OAC
- Early: Script-based simulators (limited to Bash, Python)
- Current: LLM-based trajectory sampling (more general but less constraint-aware)
- Next: Hybrid symbolic + neural for constraint satisfaction
Video Generation: Variational Video Autoencoders → Video Diffusion → Sora/Genie
- VAV: Learned latent dynamics (precise but low fidelity)
- Diffusion: High fidelity but slower inference, less action-conditioned
- Sora: Multimodal training (video + text), 1-2 minute generation
Scientific Discovery: Traditional Bayesian optimization → Neural surrogates → Active learning loops
- Bayesian: Principled uncertainty, expensive
- Neural: Fast inference, calibration challenging
- Active learning: Combines both for sample efficiency
Open Problems and Research Directions
Fundamental Challenges
-
Cross-regime transfer: Can a world model trained on one regime (e.g., physics) help in another (e.g., social)?
- Tentative answer: Possibly, if learning hierarchical abstractions
-
Constraint generalization: How do models learn that constraints hold across domains they haven't seen?
- Challenge: Physics holds everywhere, but social norms don't; models need to recognize this
-
Closed-loop L3 design: How do you design agents that safely revise their own models?
- Requires: Interpretability, anomaly detection, version control for learned models, regression testing
-
Scalability: Current video generation (Sora) works for ~1 min; can we scale to hours?
- Bottleneck: Compounding error, compute scaling, attention mechanisms for long sequences
Architectural Directions
-
Compositional learning: Can we build world models from modular pieces (object detectors, interaction rules) that compose reliably?
-
Uncertainty quantification: Current models give point predictions; better uncertainty estimates could reduce exploration waste and enable better planning
-
Adaptive latent spaces: Can models dynamically expand their state representation when encountering novel concepts?
-
Neuro-symbolic integration: Deep learning for perception + symbolic reasoning for constraint satisfaction
Reproducibility and Implementation Notes
Data Requirements
- Physical: Video + action annotations (millions of frames)
- Example: Robotic manipulation datasets (RoboNet: 15M+ video clips)
- Digital: Browser traces + API logs
- Example: OSWorld (912 tasks), macOSWorld
- Social: Dialogue corpora + metadata (speaker relationships, outcomes)
- Example: Sotopia scenarios
- Scientific: Experimental logs + measurements
- Example: Benchmark datasets from literature
Typical Training Procedure
1 | 1. Collect trajectory data D = {(s_t, a_t, s_{t+1})} |
Computational Cost
- Training L1: GPU-weeks for visual models (depends on data scale)
- Inference: Real-time for robotics (∼10ms per step), interactive for code/web (100s ms for multi-step reasoning)
- L3 updating: Continuous background process (efficient retraining on new examples)
Verdict and Impact
Strengths
- Conceptual unification: The levels × laws framework aligns fragmented communities
- Comprehensive scope: 400+ papers synthesized with clear organization
- Practical guidance: Implementation roadmaps for each regime
- Honest assessment: Open problems clearly stated; no false consensus
Limitations
- Framework maturity: L3 exists mostly in theory; few deployed systems
- Benchmark gaps: Evaluation infrastructure incomplete across regimes
- Generalization unclear: How do insights from robotics transfer to code? To science?
Who Should Read This?
- Researchers building world models (RL, vision, agents) → essential unification framework
- ML engineers deploying agentic systems → architectural guidance and failure mode catalogue
- Science administrators → roadmap for AI-driven discovery
- Policy makers → understanding agent capabilities and limitations
Future Impact
This paper may become the standard taxonomy for world models across AI—similar to how transformer papers unified NLP architectures. The levels × laws framework provides the conceptual foundation for:
- Comparing progress across domains
- Identifying and plugging research gaps
- Building safer, more interpretable agents that revise their own models
The move from L1 → L2 → L3 reflects an implicit progression: from passive prediction to active simulation to autonomous adaptation. L3 remains largely open; papers that crack reliable L3 systems (robotics with online model updating, AI-driven science with closed-loop discovery) will define the next era of agentic AI.
Key Takeaways
-
World models are not one thing: The same term applies to different capabilities (L1/L2/L3) and constraints (physical/digital/social/scientific)
-
Capability levels matter more than prediction accuracy: A model that perfectly predicts next frames but can't compose or respond to actions is useless for planning
-
Domain laws are non-negotiable: Constraint violations (penetrations, type errors, norm breaches, causal inversions) make simulated plans irrealizable
-
Evaluation must be decision-centric: Judge models by whether they improve downstream agent performance, not by prediction loss alone
-
L3 is the frontier: Moving from L1/L2 (passive) to L3 (adaptive) requires solving interpretability, anomaly detection, and safe model revision—open challenges with major implications for AI safety
-
Cross-regime insights exist: Robotics teaches us about compounding error; code teaches us about constraint checking; science teaches us about uncertainty quantification
Extended Resources
- Homepage: https://agentic-world-modeling.xyz
- GitHub: https://github.com/matrix-agent/awesome-agentic-world-modeling
- Citation: Chu et al., "Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond," arXiv:2604.22748, 2026