0%

一篇关于 Tutti 的中文阅读笔记:它从 GPU-native KV cache object store、GPU io_uring 与 slack-aware scheduling 出发,让 SSD-backed KV cache 更适合长上下文 LLM serving。
Read more »

A detailed technical review of Swift-SVD, an activation-aware low-rank compression method for LLM weights and KV cache that uses output covariance eigendecomposition to avoid expensive generalized SVD.
Read more »

A detailed technical review of Piper, a resource-model-driven system for large-scale MoE training with pipelined hybrid parallelism, HALO hierarchical all-to-all, and topology-aware expert placement.
Read more »

A detailed technical review of NExt, a method that models low-rank optimization trajectories to accelerate reinforcement learning with verifiable rewards for large language models.
Read more »

A detailed technical review of FEPLB, a system that uses Hopper NVLink Copy Engines to perform fine-grained MoE load balancing with little interference to normal expert-parallel training.
Read more »

Author: Zhongzhu Zhou
Paper: Chu et al., 2026. arXiv:2604.22748 [cs.AI]
Date: April 27, 2026
Direction: Monday, April 27 — Agent/LLM Quality Generation
Pages: 10


Executive Summary

As AI systems evolve from text generators to goal-achieving agents that interact with complex environments, predicting environment dynamics has become the central bottleneck. This comprehensive survey paper provides a unified framework for understanding world models—internal representations that agents use to anticipate consequences of their actions and plan accordingly.

The paper introduces a elegant "levels × laws" taxonomy:

  • Three capability levels (L1 Predictor → L2 Simulator → L3 Evolver) define what a world model can do
  • Four governing-law regimes (physical, digital, social, scientific) define the constraints it must satisfy

By synthesizing over 400 papers across model-based RL, video generation, web/GUI agents, multi-agent simulation, and AI-driven science, the authors reveal a fragmented landscape where "world model" means different things to different communities. Their framework provides the common language needed to align these communities.


Read more »

1. Executive Summary

OGER (Offline-Guided Exploration Reward) introduces a novel framework for enhancing Large Language Model (LLM) reasoning by seamlessly integrating offline teacher trajectories with online reinforcement learning. The key innovation lies in positioning offline data as a semantic reference point for computing auxiliary exploration rewards, rather than treating it as additional training samples.

The framework addresses critical limitations in current RLVR (Reinforcement Learning with Verifiable Rewards) approaches: the "echo chamber" effect where models converge to dominant pre-existing distributions, and entropy collapse that prevents novel solution discovery. By computing divergence-based exploration rewards and refining them through entropy-aware modulation, OGER achieves 4-7.9% improvements across mathematical and general reasoning benchmarks.


Read more »

1. What This Paper Does

Core Problem

The edge of stability phenomenon, discovered by Cohen et al. (2021), presents a theoretical puzzle: when training with sufficiently large learning rates η, the largest Hessian eigenvalue λ₁ frequently exceeds the stability threshold 2/η, implying the system should diverge according to classical optimization theory. Yet empirically:

  • Training loss continues to decrease
  • Model generalization often improves in this regime
  • The optimizer doesn't settle at a point but explores a bounded, chaotic set

Prior explanations relying on pointwise properties (Hessian trace, spectral norm) fail to capture this phenomenon because they ignore the ensemble behavior of the attractor set.

Main Contribution

The paper's central insight: characterize generalization through the geometric properties of the random attractor itself, not individual solutions.

They prove that:

  1. Sharpness Dimension (SD) < ambient dimension d with high probability at EoS
  2. Worst-case generalization error depends on SD, not parameter count d
  3. The complete Hessian spectrum structure matters, not just the trace or largest eigenvalue
  4. The attractor forms a fractal set with intrinsic dimension strictly smaller than the parameter space

This explains why overparameterized models generalize: the training dynamics naturally compress into a lower-dimensional manifold despite the high-dimensional parameter space.


Read more »