A detailed technical review of Kong et al.'s interpretable latency model for speculative decoding under real serving workloads. Using a roofline-style decomposition plus Little's Law, the paper collapses RPS-versus-latency curves onto a single universal form and gives a mechanistic explanation for why batch=1 SD speedups erode under load.
A detailed technical review of Kong et al.'s interpretable latency model for speculative decoding under real serving workloads. Using a roofline-style decomposition plus Little's Law, the paper collapses RPS-versus-latency curves onto a single universal form and gives a mechanistic explanation for why batch=1 SD speedups erode under load.
A detailed technical review of Zero Sum SVD, which replaces per-layer rank optimization with a global, signed loss-sensitivity heap and a greedy zero-sum rule, letting heterogeneous per-layer ranks fall out of one scalar conservation law.
A detailed technical review of Zero Sum SVD, which replaces per-layer rank optimization with a global, signed loss-sensitivity heap and a greedy zero-sum rule, letting heterogeneous per-layer ranks fall out of one scalar conservation law.
A detailed technical review of DisagMoE, which disaggregates attention and FFN layers onto separate GPU pools and stitches them together via the AF-Pipe schedule to hide the MoE all-to-all bottleneck during training.
A detailed technical review of DisagMoE, which disaggregates attention and FFN layers onto separate GPU pools and stitches them together via the AF-Pipe schedule to hide the MoE all-to-all bottleneck during training.
A detailed technical review of DAPO, an open-source large-scale reinforcement learning recipe for reasoning LLMs using Clip-Higher, dynamic sampling, token-level loss, and overlong reward shaping.
A detailed technical review of DAPO, an open-source large-scale reinforcement learning recipe for reasoning LLMs using Clip-Higher, dynamic sampling, token-level loss, and overlong reward shaping.
A detailed technical review of MASPO, a joint prompt optimization method for multi-agent LLM systems that balances local, downstream, and global rewards.
A detailed technical review of MASPO, a joint prompt optimization method for multi-agent LLM systems that balances local, downstream, and global rewards.