Zhongzhu's Blog

PipeSD：基于推测解码的云边协同流水线推理框架 —— 阅读笔记

Posted on 2026-05-17 In LLM Inference , Edge Systems

PipeSD 把云边协同推测解码视为三资源（草稿、网络、验证）流水线问题，用 DP 最优 token-batch 调度 + 双阈值 NAV 触发器，在真实云边测试床上把 TPT 提升 1.16×–2.16×，能耗下降 14.3%–25.3%。

PipeSD（ICML 2026）阅读笔记：把云边协同推测解码重新建模为「草稿 / 网络 / 验证」三资源流水线问题，提出 O(N̂²) 动态规划的 token-batch 最优调度（额外开销 <0.013%），并配合 token-level + sequence-level 双阈值 NAV 触发器（Bayesian 优化约 16 次采样即可在线整定），在 ThinkBook 16+ ↔ 天翼云 A800 真实测试床上取得 1.16×–2.16× TPT 提速和 14.3%–25.3% ECS 能耗下降。

PipeSD: Cloud-Edge Collaborative Pipeline Inference with Speculative Decoding — Technical Review

Posted on 2026-05-17 In LLM Inference , Edge Systems

PipeSD reframes cloud-edge speculative decoding as a three-resource pipelining problem (draft, network, verify) and shows that DP-optimal token-batch scheduling plus a confidence-based verify trigger together yield 1.16×–2.16× TPT improvement on a real edge-cloud testbed.

A technical review of PipeSD (ICML 2026), a cloud-edge collaborative speculative decoding framework with two contributions: a DP-optimal token-batch pipeline schedule that overlaps draft, network, and verify across three resources, and a dual-threshold NAV trigger (token + sequence confidence, BO-autotuned) that suppresses unhelpful verifications. Reports 1.16×–2.16× TPT speedup and 14.3%–25.3% ECS reduction over Vanilla/HSL/EdgeLLM on a real ThinkBook 16+ ↔ Tianyi A800 testbed.

用 Little 定律解释推测解码在真实服务中的提速曲线 —— 阅读笔记

Posted on 2026-05-16 In LLM Inference , Serving Systems

Kong 等人提出的「面向真实服务的推测解码延迟模型」阅读笔记。用 roofline 风格的延迟分解加 Little 定律，把不同 RPS、模型、硬件下的延迟曲线压缩到同一条 1/(1-x) 通用形上，并从机制层面解释了「batch=1 SD 提速在高负载下消失」的现象。

An Interpretable Latency Model for Speculative Decoding in LLM Serving — Technical Review

Posted on 2026-05-16 In LLM Inference , Serving Systems

A detailed technical review of Kong et al.'s interpretable latency model for speculative decoding under real serving workloads. Using a roofline-style decomposition plus Little's Law, the paper collapses RPS-versus-latency curves onto a single universal form and gives a mechanistic explanation for why batch=1 SD speedups erode under load.

Zero Sum SVD：用「损失零和」做全局奇异值预算分配的 LLM 压缩方法

Posted on 2026-05-15 In LLM Compression , Model Optimization

一篇关于 Zero Sum SVD 的中文阅读笔记：把所有层的奇异值堆到一个全局优先队列里，用带符号的损失敏感度和「零和守恒」的贪心规则一次性决定全模型的秩预算，让异质化的逐层秩自然从一条标量约束里掉出来。

Zero Sum SVD: A Global, Loss-Aware Rank Budget for LLM Compression

Posted on 2026-05-15 In LLM Compression , Model Optimization

A detailed technical review of Zero Sum SVD, which replaces per-layer rank optimization with a global, signed loss-sensitivity heap and a greedy zero-sum rule, letting heterogeneous per-layer ranks fall out of one scalar conservation law.

DisagMoE：用解耦 Attention 和 FFN 打通 MoE 训练的 all-to-all 瓶颈

Posted on 2026-05-14 Edited on 2026-05-15 In LLM Training , Distributed Systems

一篇关于 DisagMoE 的中文阅读笔记：把 attention 和 FFN 分别放到独立 GPU 池，用 AF-Pipe 调度和 M2N 通讯原语把两侧拼起来，从而把 MoE 训练里的 all-to-all 瓶颈藏进计算之下。

DisagMoE: Disaggregating Attention and FFN to Beat the MoE All-to-All Bottleneck

Posted on 2026-05-14 Edited on 2026-05-15 In LLM Training , Distributed Systems

A detailed technical review of DisagMoE, which disaggregates attention and FFN layers onto separate GPU pools and stitches them together via the AF-Pipe schedule to hide the MoE all-to-all bottleneck during training.

DAPO：大规模开源 LLM 强化学习系统

Posted on 2026-05-12 In RL Training , LLM Training

一篇关于 DAPO 的中文阅读笔记：它把 Clip-Higher、动态采样、token-level loss 与 overlong reward shaping 组合成可复现的大规模 LLM 强化学习配方。

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Posted on 2026-05-12 In RL Training , LLM Training

A detailed technical review of DAPO, an open-source large-scale reinforcement learning recipe for reasoning LLMs using Clip-Higher, dynamic sampling, token-level loss, and overlong reward shaping.

MASPO：面向 LLM 多智能体系统的联合提示词优化

Posted on 2026-05-11 In AI Agents , Prompt Optimization

一篇关于 MASPO 的中文阅读笔记：它用 local、lookahead 与 global 三类信号联合优化 LLM 多智能体系统中的角色提示词。

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Posted on 2026-05-11 In AI Agents , Prompt Optimization

A detailed technical review of MASPO, a joint prompt optimization method for multi-agent LLM systems that balances local, downstream, and global rewards.