0%

Zero Sum SVD: A Global, Loss-Aware Rank Budget for LLM Compression

Review date: 2026-05-15 Reviewer: Zhongzhu Zhou Paper: Zero Sum SVD: Balancing Loss Sensitivity for Low Rank LLM Compression Authors: Ali Abbasi, Chayne Thrash, Haoran Qin, Shansita Sharma, Sepehr Seifi, Soheil Kolouri (Department of Computer Science, Vanderbilt University) arXiv: 2602.02848v1, 2026-02-02 Status: Preprint. Code at https://github.com/mint-vu/Zero-Sum-SVD.


Short answer

This paper revisits a question that I find every SVD-based LLM compression paper ducks one way or another: given a target global parameter budget, how do you decide how many singular values to keep in each layer?

Two conventional answers dominate the literature:

  • Homogeneous allocation: pick one global ratio ρ and apply it everywhere (SVD-LLM, ASVD).
  • Per-layer optimization: treat the truncation rank as a continuous variable and optimize it with backpropagation (Dobi-SVD).

Zero Sum SVD (ZS-SVD) is the third answer. It does not optimize per-layer ranks at all. Instead, it pools every singular value across the entire model into one global heap, scores each one with a signed first-order loss sensitivity ∆Lᵢ ≈ −σᵢ · gσ,ᵢ, and applies a greedy "zero-sum" rule that picks the next component so the running sum of predicted loss changes stays close to zero. Heterogeneous per-layer ranks fall out automatically.

In one paragraph: ZS-SVD computes the whitened weight A = W·S (where S is the Cholesky factor of the activation covariance), takes its SVD A = UΣVᵀ, computes a calibration-loss gradient H = ∇_W L · S⁻ᵀ in the whitened parameterization, and uses ∆Lᵢ = −σᵢ · uᵢᵀ H vᵢ as a first-order estimate of how much L would change if we set σᵢ ← 0. Then it greedily picks the next singular value to prune such that the cumulative drift s = Σ ∆Lⱼ oscillates around zero. Each layer must still prune its own σ's in ascending order to preserve the local optimality of truncated SVD in whitened coordinates. An optional one-step "truncate–correct–re-truncate" cycle further closes the loss gap by leveraging the empirical low-rankness of LLM gradients near pretrained solutions.

Across LLaMA-7B / 2-7B / 13B / 30B, Vicuna-7B, and OPT-6.7B on WikiText2 / PTB / C4 perplexity and seven zero-shot reasoning tasks, ZS-SVD beats ASVD, SVD-LLM, Dobi-SVD, Dip-SVD, plus LLM-Pruner / SliceGPT / Wanda-sp / FLAP structured pruning, often by wide margins. At 0.4 maintenance ratio on LLaMA-7B with Dobi-SVD-style remapping or HQ (half-prune + quantize), it pushes WikiText2 PPL from a baseline of 5.68 to 6.73—a regime where vanilla SVD-LLM blows up to 53.74. Throughput improves up to 5.86× over uncompressed on RTX A5000 at 60% compression, and end-to-end truncation time is 15.9 min on LLaMA-7B versus 19.25 h for Dobi-SVD.

The most interesting structural insight, in my reading, is the substitution of a combinatorial rank-allocation problem with a single scalar conservation law (s ≈ 0). That kind of "swap optimization for a feasibility constraint" pattern is rare in the LLM compression literature but extremely common in numerical optimization and control theory, and I think it generalizes.


1. Prerequisites

This section is for readers who have trained transformers but have not internalized the SVD-compression line of work. I cover SVD, whitening, activation reconstruction, calibration loss, Fisher importance, and heterogeneous rank allocation. Skip ahead if you have read SVD-LLM (ICLR 2025).

1.1 Singular Value Decomposition in one paragraph

Every real m×n matrix A admits a factorization A = U Σ Vᵀ with U ∈ ℝ^{m×m} and V ∈ ℝ^{n×n} orthogonal and Σ ∈ ℝ^{m×n} diagonal with non-negative entries σ₁ ≥ σ₂ ≥ … ≥ 0. The rank-k truncation Aₖ = Uₖ Σₖ Vₖᵀ retains only the top-k singular triplets and is, by the Eckart–Young–Mirsky theorem, the best rank-k approximation of A in the Frobenius norm. Storage-wise, a rank-k factorization of W ∈ ℝ^{m×n} stores k(m+n) parameters instead of mn, which becomes a saving when k ≤ mn / (m+n).

1.2 Why direct SVD of W is insufficient

The naive recipe—SVD the weight, drop small σ's—is wrong for LLMs because W is used inside the product W·X. A direction with a small singular value in weight space can still be a high-energy direction along the empirical activation distribution; truncating it inflicts large activation reconstruction error. SVD-based LLM compression therefore folds activation statistics into the factorization.

1.3 Whitening

Estimate the activation covariance C = X·Xᵀ ∈ ℝ^{n×n} from a small calibration set, and find S satisfying S·Sᵀ = C (typically the Cholesky factor of C + λI for numerical stability). Define the whitened weight A = W·S. The key identity (Theorem 3.1 in the paper) is

‖W X − W' X‖²_F = tr[(W − W') C (W − W')ᵀ] = ‖(W − W') S‖²_F.

So minimizing activation reconstruction error in input space is equivalent to minimizing a Frobenius error in whitened weight space. By Eckart–Young, the optimal rank-k approximation in whitened space is the top-k SVD of A, mapped back via S⁻¹. This is the algebraic foundation underlying ASVD, SVD-LLM, Dobi-SVD, and now ZS-SVD.

1.4 Calibration loss

Activation reconstruction is a local objective: it only controls the error at one layer's output. The end-to-end objective is the language modeling loss L on a calibration set (the paper uses 256 sequences of length 2048 from WikiText2). Truncating a σᵢ can interact non-trivially with downstream layers, so a small ‖W X − W' X‖ does not imply a small ΔL. The ZS-SVD contribution is to bridge this gap with a per-σ first-order estimate of ΔL.

1.5 Fisher importance and its limitations

Older SVD-compression work (FWSVD, Dip-SVD) uses the Fisher information F as a parameter-importance proxy. Fisher importance is a second-order quantity (it's the diagonal of the Hessian under MLE); high curvature in a direction means the parameter is loss-sensitive. The drawback is that Fisher-based weighting yields a heuristic per-matrix score rather than a per-singular-value score; it cannot directly answer "should I prune σ₂ in layer 4 or σ_17 in layer 12 first?" ZS-SVD trades the second-order Fisher view for a first-order, signed, per-σ score, then enforces global coherence with the zero-sum rule.

1.6 Post-training compression

ZS-SVD is post-training: no retraining, no fine-tuning, only a one-shot calibration on a small dataset. This makes it compatible with any published checkpoint and orthogonal to quantization, distillation, and pruning—they can be stacked.

1.7 Heterogeneous rank allocation

The six matrices in a transformer block (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj) differ in their loss sensitivity. Heterogeneous rank allocation gives each its own rank under a global budget, in contrast to homogeneous allocation (one ratio for all). Heterogeneous is empirically better but harder to compute: it requires either a combinatorial search (intractable) or a continuous relaxation (Dobi-SVD), or—as ZS-SVD shows—a clever feasibility rule that side-steps optimization.

1.8 First-order Taylor expansion of the loss

The core algebraic move of the paper. For any small ∆W,

L(W + ∆W) ≈ L(W) + ⟨∇_W L, ∆W⟩.

Setting σᵢ to zero corresponds to ∆A = −σᵢ uᵢ vᵢᵀ in whitened space, which back-translates to ∆W = ∆A · S⁻¹. Plug in:

∆Lᵢ ≈ ⟨∇_W L, ∆A · S⁻¹⟩ = ⟨∇_W L · S⁻ᵀ, ∆A⟩ = −σᵢ uᵢᵀ H vᵢ,

with H ≜ ∇_W L · S⁻ᵀ. Importantly, ∆Lᵢ is signed: σᵢ ≥ 0 but uᵢᵀ H vᵢ can be positive or negative. Some singular values, when pruned, are predicted to decrease the loss. This sign asymmetry is what the zero-sum selection exploits.


2. Method

2.1 Pipeline overview

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Pretrained LLM weights {W_ℓ}

├── Per-layer Cholesky → S_ℓ
├── Per-layer whitened SVD → A_ℓ = W_ℓ S_ℓ = U_ℓ Σ_ℓ V_ℓᵀ
├── One full backward pass → G_ℓ = ∇_{W_ℓ} L
└── Per-σ score → ∆L_{ℓ,i} = −σ_{ℓ,i} · u_{ℓ,i}ᵀ (G_ℓ S_ℓ⁻ᵀ) v_{ℓ,i}


Global greedy with two heaps (Q⁺ for ∆L ≥ 0, Q⁻ for ∆L < 0)
Pop from Q⁺ when s ≤ 0; from Q⁻ when s > 0.
Each layer respects internal ascending σ order.


Final per-layer ranks k_ℓ.


(Optional) Truncate–correct–re-truncate (×1, ×5, ×10).


Compressed weights {W'_ℓ}.

2.2 Computing per-σ sensitivity

For each layer ℓ:

  1. Compute the activation second moment C_ℓ = X_ℓ X_ℓᵀ on the calibration set.
  2. Cholesky-decompose C_ℓ + λI to get S_ℓ (lower-triangular, so S_ℓ⁻ᵀ is cheap).
  3. Form A_ℓ = W_ℓ S_ℓ and run SVD to get U_ℓ Σ_ℓ V_ℓᵀ.
  4. Run one full backward pass over the calibration mini-batch to obtain G_ℓ = ∇_{W_ℓ} L.
  5. Form H_ℓ = G_ℓ S_ℓ⁻ᵀ.
  6. For each retained σ_{ℓ,i}, compute the scalar gσ,i = uᵢᵀ H_ℓ vᵢ and store ∆L_{ℓ,i} = −σ_{ℓ,i} · gσ,i.

Steps 1–3 are the same cost SVD-LLM already pays. Step 4 is one backward pass on a 256×2048 mini-batch. Steps 5–6 are O(k·n²) per layer in the worst case but trivially batched. The end-to-end measurement on LLaMA-7B is 15.9 min vs 7.9 min for SVD-LLM and 19.25 h for Dobi-SVD. ZS-SVD basically buys "Dobi-SVD-grade accuracy at SVD-LLM-grade runtime."

2.3 The zero-sum selection rule

Maintain s = Σ_{j∈D} ∆Lⱼ where D is the set of currently pruned components. The selection rule is

  • if s ≤ 0, prefer Q⁺ (a component with ∆Lᵢ ≥ 0, predicted to increase loss);
  • if s > 0, prefer Q⁻ (a component with ∆Lᵢ < 0, predicted to decrease loss).

Within each heap, candidates are keyed by |∆Lᵢ| in ascending order, so the next pop is the one with the smallest predicted magnitude on the chosen sign. The intuition is: never let the cumulative predicted drift wander far from zero, so that the first-order Taylor expansion remains valid; pick the smallest available impact in the corrective direction to minimize the perturbation magnitude.

Crucially, each layer must internally prune in ascending σ order—you cannot skip ahead within a layer. This is a hard constraint, not a soft preference. The reason is that the Eckart–Young optimality of truncated SVD in whitened space relies on dropping the smallest σ's; if you allow within-layer reordering, the whitened-space reconstruction error is no longer minimized, and the predicted ∆Lᵢ values (computed under the assumption of dropping σᵢ from the current truncation point) become inconsistent. The ablation in Table 6 confirms this: drop the constraint and PPL explodes by orders of magnitude across all selection strategies.

2.4 One-step correction with re-truncation

After global truncation, let W'_k be the current rank-k factorization and ∆W = W − W'_k be the truncation residual. The paper proposes a single projected gradient step:

∆W' = (⟨g, ∆W⟩ / ⟨g, g⟩) · g, where g = ∇_W L(W'_k).

This is the minimum-Frobenius-norm perturbation of W'_k that matches the first-order loss change predicted by moving back to W. Because rank(∆W') = rank(g) and gradients near a pretrained solution are empirically low-rank, the updated W'_k + ∆W' has rank at most k + rank(g), which is small. Re-truncating back to rank k therefore incurs a small projection error.

Figure 3 of the paper validates the gradient-low-rank assumption on LLaMA-2-7B: at 20% compression, the effective-rank ratio k_{0.95}(G) / k_{0.95}(W') is in the range 0.001–0.4 across layers 1, 16, 32, with most modules near 0.05. Empirically, repeating the truncate–correct–re-truncate cycle 1, 5, or 10 times monotonically lowers perplexity, with the largest gains at aggressive compression ratios (0.4 → ZS-SVD 10× achieves PPL 18.49 vs ZS-SVD-1× at 26.92).

2.5 Storage remapping and HQ

The parameter ratio ρ = k(m+n) / mn used in vanilla SVD compression saturates before full rank: ρ = 1 corresponds to k = mn/(m+n) < min(m, n). This is awkward for high-retention regimes. Following Dobi-SVD, ZS-SVD also reports a remapped variant where ρ̃ = k / rank(W) is in 1-to-1 correspondence with k. Storage is realized via a packed format: assuming m ≥ n, an 8-bit copy of Vₖ is packed into the first n rows of Uₖ, yielding a footprint of k · max(m, n). Each pruned σ saves 2m bytes ≈ max(m, n) fp16-equivalent parameters, which ZS-SVD plugs into its budget accounting.

For aggressive ≥50% compression, the paper proposes HQ (Half-prune + Quantize): instead of remapping, prune to half the target ratio and uniformly halve the bit-width of all target parameters (e.g., fp16 → fp8). Total footprint matches, but perplexity is much better in the high-compression regime. At 0.4 maintenance on LLaMA-7B, ZS-SVD† (HQ) achieves PPL 6.73—a 7.3% loss drop vs the uncompressed baseline.

2.6 Engineering choices

  • Calibration: 256 sequences × 2048 tokens from WikiText2 (same as SVD-LLM).
  • Compressed matrices: q/k/v/o + MLP linears only; embeddings and LayerNorm untouched.
  • Hardware: NVIDIA RTX PRO 6000 Blackwell Max-Q 96 GB for all timing measurements.
  • Single-pass: one backward pass, one Cholesky per layer, one SVD per layer. No iterative optimization on rank itself.

2.7 Pseudocode and complexity

The full procedure compresses neatly into one pseudocode block:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def zs_svd(model, calib_loader, target_budget):
layers = list(model.linear_layers())
# 1. Whitened SVD per layer
S, A, U, Σ, V = [], [], [], [], []
for ℓ in layers:
C_ℓ = activation_covariance(ℓ, calib_loader)
S_ℓ = cholesky(C_ℓ + λI)
A_ℓ = ℓ.W @ S_ℓ
U_ℓ, Σ_ℓ, V_ℓ = svd(A_ℓ)
S.append(S_ℓ); A.append(A_ℓ); U.append(U_ℓ); Σ.append(Σ_ℓ); V.append(V_ℓ)

# 2. One backward pass for all gradients
L = calibration_loss(model, calib_loader)
G = backward(L)

# 3. Per-σ first-order loss change
ΔL = {}
for ℓ in layers:
H_ℓ = G[ℓ] @ inv(S[ℓ].T)
for i in range(rank(A[ℓ])):
ΔL[(ℓ, i)] = -Σ[ℓ][i] * (U[ℓ][:, i].T @ H_ℓ @ V[ℓ][:, i])

# 4. Two-heap zero-sum greedy
Q_pos = MinHeap(key=lambda x: abs(ΔL[x])) # ΔL ≥ 0
Q_neg = MinHeap(key=lambda x: abs(ΔL[x])) # ΔL < 0
cursor = {ℓ: rank(A[ℓ]) - 1 for ℓ in layers} # smallest σ still present
for ℓ in layers:
i = cursor[ℓ]
(Q_pos if ΔL[(ℓ, i)] >= 0 else Q_neg).push((ℓ, i))

s = 0
removed = set()
while saved_params(removed) < target_budget:
prefer = Q_pos if s <= 0 else Q_neg
if not prefer: prefer = Q_neg if prefer is Q_pos else Q_pos
(ℓ, i) = prefer.pop()
s += ΔL[(ℓ, i)]
removed.add((ℓ, i))
cursor[ℓ] -= 1
if cursor[ℓ] >= 0:
j = cursor[ℓ]
(Q_pos if ΔL[(ℓ, j)] >= 0 else Q_neg).push((ℓ, j))

# 5. Build compressed factors
for ℓ in layers:
keep = sorted({i for i in range(rank(A[ℓ])) if (ℓ, i) not in removed})
Uk, Σk, Vk = U[ℓ][:, keep], diag(Σ[ℓ][keep]), V[ℓ][:, keep]
ℓ.W_u = Uk @ sqrt(Σk)
ℓ.W_v = sqrt(Σk) @ Vk.T @ inv(S[ℓ])

Per-step complexity:

  • Cholesky: O(n³) per layer. For LLaMA-7B with hidden = 4096, ≈ 7×10¹⁰ flops.
  • SVD: O(m·n²); randomized variants drop to O(m·n·k).
  • Backward: one full forward + backward over the 256×2048 calibration batch—same order as a regular training step.
  • uᵢᵀ H vᵢ inner products: O(m·n) per σ, but easily batched as a single matmul per layer.
  • Heap operations: O(N log N) over N ≈ 10⁷ total σ's in a 7B model—milliseconds.

So the wall-clock budget is dominated by SVD + backward, which is the same order as SVD-LLM. The 2× factor versus SVD-LLM comes from the extra backward pass and the per-σ scoring; everything else is identical.

2.8 Engineering comparison with Dobi-SVD

Dimension Dobi-SVD ZS-SVD
Rank decision Backprop optimization Global greedy + zero-sum drift
Iterations Thousands of differentiable SVD steps 1 backward + 1 heap greedy
Peak memory High (IPCA backprop) Moderate (one backward + one SVD)
LLaMA-7B end-to-end time 19.25 hours 15.9 minutes
Accuracy ceiling High Slightly higher
Interpretability Black-box optimization First-order expansion + conservation law

The practical win of ZS-SVD is not "fewer flops"—it's removing a long optimization loop entirely. From a deployment engineer's perspective, you take a published checkpoint, run a 15-minute job, and out comes a usable compressed model with no extra training data. That is qualitatively different from Dobi-SVD's "spin up a fine-tune-shaped workflow for a day."


3.5b A deeper read of the catastrophic ablation

I want to spend an extra paragraph on the "Most negative ∆L" entry in Table 6, because it reveals the operative engineering intuition behind ZS-SVD's design.

Imagine you are at W₀ and you do a Taylor expansion L(W₀ + ∆W) ≈ L(W₀) + ⟨g, ∆W⟩. This linear approximation is valid only within some neighborhood R (the linear regime). Once ‖∆W‖ exits R, higher-order terms dominate and the linear prediction becomes meaningless.

"Most negative ∆L" greedily pushes s further negative at every step—each decision adds a perturbation in roughly the same direction. The cumulative ‖∆W‖ grows like O(t), so after enough steps you leave R, and the predicted "we keep lowering loss" is decoupled from reality. Hence PPL of 160594.

Zero-sum picks alternating signs, producing a random-walk-like trajectory in parameter space. After t steps, expected ‖∆W‖ scales like O(√t), not O(t). On a 7B model with millions of σ's, this difference is many orders of magnitude—enough to keep cumulative perturbation inside R and the linear approximation valid throughout the run. This is the entire reason ZS-SVD works.

There is a clean name for this design pattern in numerical optimization: conservation-based step control. It appears in stochastic optimal control and simulated annealing, but to my knowledge ZS-SVD is the first LLM compression method to use it explicitly. I expect the pattern to spread—any compression decision that can be linearized around the current operating point and that has a signed per-decision sensitivity should benefit from the same trick.


4.3 Practical deployment notes

For practitioners considering ZS-SVD in production, a few notes from reading the paper carefully:

  • What to recompute when the calibration data changes. Only the gradient pass and the per-σ scores need to be redone; the whitening and SVD factorizations are reusable. This means you can amortize most of the cost across multiple target compression ratios.
  • Choosing the budget. The paper's experiments suggest that without HQ or remapping, ZS-SVD is reliable down to 0.6 maintenance. Below that, plan on combining with quantization.
  • Memory during compression. Holding U, Σ, V for every linear layer is non-trivial on a 7B model—roughly 2× the model's parameter footprint. The 30B experiments would not fit in 80 GB without streaming.
  • What to skip. Embedding layers, LayerNorm, and the LM head should not go through ZS-SVD. The paper restricts to attention projections and MLP linears, which is the right default.
  • Calibration set size. 256 × 2048 = ~500K tokens is enough for stable C and G estimates. Going much smaller is risky; going larger has diminishing returns.

3. Experiments

3.1 Main table (Table 1, LLaMA-7B, maintenance ratios 0.8 / 0.6 / 0.4)

Ratio Method WikiText2 PPL Avg Acc Drop
1.0 Baseline 5.68 0.55 0.0%
0.8 ASVD 11.14 0.43 21.8%
0.8 SVD-LLM 7.94 0.44 20.0%
0.8 Dobi-SVD 8.54 0.46 16.4%
0.8 ZS-SVD 6.74 0.50 9.1%
0.8 ZS-SVD 5× 6.43 0.51 7.3%
0.8 ZS-SVD* (remap) 5.90 0.54 1.8%
0.6 SVD-LLM 13.11 0.37 32.7%
0.6 Dobi-SVD 13.54 0.38 30.9%
0.6 ZS-SVD 5× 9.45 0.42 23.6%
0.6 ZS-SVD* (remap) 6.96 0.50 9.1%
0.4 SVD-LLM 53.74 0.31 43.6%
0.4 Dobi-SVD 46.18 0.32 41.8%
0.4 ZS-SVD 10× 18.49 0.35 36.4%
0.4 ZS-SVD† (HQ) 6.73 0.51 7.3%

Three patterns stand out:

  1. Even without remapping, vanilla ZS-SVD beats SVD-LLM and Dobi-SVD across the board. The zero-sum rule alone is doing meaningful work.
  2. Remapping flattens the loss curve substantially because the packed storage format permits retention of more rank at the same byte budget.
  3. HQ dominates at extreme compression. Stacking moderate SVD with uniform half-precision is more stable than aggressive pure-SVD truncation. This is a useful operational lesson independent of ZS-SVD itself: in the deep-compression regime, mixing modalities beats maxing one out.

3.2 Against Dip-SVD (Table 2, 30% pruning)

Model Method W2 PPL PTB C4
LLaMA-7B ASVD 95.3 200.9 86.3
LLaMA-7B FWSVD 33.0 53.6 38.2
LLaMA-7B SVD-LLM 9.5 29.0 26.4
LLaMA-7B Dip-SVD 9.4 22.3 19.9
LLaMA-7B ZS-SVD 8.2 19.6 16.8
Vicuna-7B SVD-LLM 12.4 124.5 39.5
Vicuna-7B Dip-SVD 12.1 81.1 28.8
Vicuna-7B ZS-SVD 10.2 48.0 21.8

Dip-SVD is the closest competitor in spirit—it also uses gradient information to weight matrices, but via a heuristic Fisher-style per-matrix score. ZS-SVD's per-σ first-order score is strictly more granular and the gap is concrete: 20–40% PPL improvements on PTB / C4.

3.3 Against structured pruning (Tables 3 & 4)

On LLaMA-2-7B at 0.6 maintenance, ZS-SVD averages 0.45 vs LLM-Pruner 0.48, SliceGPT 0.51, Bonsai 0.53, Wanda-sp 0.50. With remapping (ZS-SVD*), ZS-SVD reaches 0.57, on par with or ahead of all structured baselines. At 0.4 with HQ, ZS-SVD† averages 0.59—exceeding every structured pruning method tested.

On LLaMA-13B at 0.8, ZS-SVD* averages 0.70, matching the uncompressed baseline. This is the regime where SVD-based methods historically lose ground to structured pruning because preserving the full attention/MLP structure matters for accuracy at moderate compression.

3.4 Cross-model scaling (Table 5, 20% pruning)

Method OPT-6.7B PPL / Acc Vicuna-7B PPL / Acc LLaMA-30B PPL / Acc
Original 10.86 / 0.52 6.78 / 0.56 4.10 / 0.61
ASVD 82.00 / 0.32 16.23 / 0.33 6.74 / 0.44
SVD-LLM 16.04 / 0.41 8.41 / 0.51 6.61 / 0.54
ZS-SVD 11.40 / 0.51 8.08 / 0.53 4.83 / 0.59

The scaling story is clean: LLaMA-30B compresses essentially losslessly at 20% (drop of 0.02 average accuracy, PPL inflation by 0.73). I find this the most compelling result in the paper. SVD-based methods historically struggle at scale because errors compound through depth; the zero-sum rule's global drift constraint seems to inoculate against compounding.

3.5 Ablation: σ selection strategies (Table 6)

Strategy Per-W σ sort Ratio 0.4 PPL Ratio 0.6 PPL
Most negative ∆L 160594 373585
Magnitude of ∆L 341.3 88.7
Most negative ∆L 182452 369350
Magnitude of ∆L 51.8 12.0
Magnitude of σ 803599 32750
Zero-sum ∆L (ZS-SVD) 45.2 11.4

This is the most informative table in the paper. Three takeaways:

  1. "Greedily minimize ∆L" is catastrophic. Always picking the most-negative ∆L (predicted to reduce loss the most) blows up. This is because first-order expansions are only valid in a small neighborhood; once the cumulative drift grows, the linear model is no longer predictive. The zero-sum rule's contribution is to keep the operating point inside the validity region of the Taylor expansion by alternating signs.
  2. Per-W ordering is mandatory. Drop it and every strategy fails. The local Eckart–Young property is non-negotiable.
  3. Pure |σ| is also wrong. Ignoring the loss landscape and only looking at whitened-space magnitudes misses the cross-layer error propagation through nonlinearities.

These three facts together pin down why the design is the way it is.

3.6 Inference efficiency (Table 7)

On RTX A5000, ZS-SVD at 60% compression delivers 762.34 tokens/s vs 130.03 for the uncompressed baseline (5.86×), while consuming less peak memory than uncompressed. On the memory-limited Titan Xp (12 GB), where uncompressed requires CPU offloading, ZS-SVD at 60% achieves 2.61× over the offloaded baseline and has lower peak memory than Dobi-SVD. These wins are hardware-agnostic—pure BLAS, no custom kernels.

3.7 Truncation time (Table 8)

Method Time LLaMA-7B WikiText2 PPL
SVD-LLM 7.9 min 53.74
Dobi-SVD 19.25 hr 46.18
ZS-SVD 15.9 min 45.17

This is the practical headline. ZS-SVD pays roughly 2× SVD-LLM's time but achieves Dobi-SVD-grade or better accuracy, and skips the 73× time penalty of Dobi-SVD's continuous rank optimization.


4. Discussion

Where does this paper sit in the LLM compression taxonomy?

1
2
3
4
5
6
7
8
9
LLM compression
├── Quantization (PTQ / QAT) → GPTQ, AWQ, SmoothQuant, QuIP, SpinQuant
├── Structured pruning → LLM-Pruner, SliceGPT, Wanda, FLAP
└── Low-rank factorization (SVD-based)
├── No activation info: FWSVD (Fisher-weighted)
├── Local activation reconstruction: ASVD, SVD-LLM (whitened)
├── Loss-aware via heuristic: Dip-SVD (per-matrix Fisher)
├── Loss-aware via differentiable optimization: Dobi-SVD
└── Loss-aware via global greedy + zero-sum drift: ZS-SVD ← here

ZS-SVD's contribution is the substitution of a combinatorial rank-allocation problem with a scalar conservation rule. There is precedent for this in numerical optimization (e.g., dual control of sum constraints, lazy regime in NTK analyses) and even in distributed systems (e.g., balanced communication in PowerSGD), but I am not aware of a direct analog in LLM compression. The fact that this substitution works as well as it does is non-trivial—the ablations show that small variants of the rule (greedy negative-∆L, |σ|-only, etc.) all fail dramatically.

I would highlight three structural properties that make the rule succeed:

  1. The signed nature of ∆Lᵢ. Some singular values are predicted to reduce the loss when pruned. Without this sign information, you cannot define a zero-sum rule at all.
  2. The per-layer σ ordering constraint. This preserves the Eckart–Young optimality that the whitened-space SVD provides locally. Without it, the global rule becomes incoherent.
  3. The reliance on first-order validity. The rule works precisely because s stays near zero, which keeps cumulative drift inside the Taylor expansion's validity region.

If you drop any of these three, the method falls apart. This makes ZS-SVD a tight, brittle construction in a good way: it's not a bag of tricks; it's three locked-together design choices.

4.1 What could break

  • Calibration data distribution shift. All ∆Lᵢ values are computed on WikiText2. If your deployment is on code, dialogue, or non-English text, the signs may flip. The paper does not study cross-domain robustness, which I see as the most important open question.
  • Compounding past the linear regime at very aggressive ratios. At 0.4 maintenance, vanilla ZS-SVD already loses significantly to the HQ variant. The first-order assumption is being stretched, and the zero-sum rule cannot fix that on its own.
  • Interaction with quantization is empirical. HQ and remapping work, but the paper does not solve the joint allocation problem (per-layer rank × per-layer bit-width). This is explicitly flagged as future work.
  • Gradient low-rankness is an assumption. Figure 3 supports it on LLaMA-2-7B, but the spread (0.001 to 0.4 across layers) is wide. For models trained differently—particularly those with low-rank adapters merged in, or aggressive distillation—this could shift.

4.2 Where I would push next

  • Apply the same global-zero-sum machinery to structured pruning. The signed first-order ∆L is well-defined for any parameter group, not just singular values. "Zero-Sum Pruning" could replace per-layer importance scoring with a global drift-controlled greedy.
  • Joint (rank, bit-width) allocation. Both are integer-valued per-layer budgets; both admit first-order sensitivity scoring; the natural extension is a 2D heap with a zero-sum rule across both axes.
  • A multi-task calibration variant: average ∆Lᵢ across several calibration sets to control distribution-shift risk.

5. Limitations and reproducibility

5.1 Limitations

  • Calibration distribution dependence (see 4.1).
  • First-order validity stretched at ratio ≤ 0.4 without quantization stacking.
  • The truncate–correct–re-truncate cycle assumes gradient low-rankness; the assumption can fail in some training regimes.
  • Uniform treatment of attention vs MLP matrices; no head-wise or expert-wise structure exploited.
  • No discussion of interaction with KV-cache compression, which is a closely related downstream concern.

5.2 Reproducibility

  • Code: https://github.com/mint-vu/Zero-Sum-SVD (referenced in the abstract).
  • Calibration: 256 × 2048 WikiText2 sequences, identical to SVD-LLM.
  • Evaluation: lm-eval-harness, standard splits of OpenBookQA, ARC-E/C, WinoGrande, HellaSwag, PIQA, MathQA.
  • Models: LLaMA-7B / 2-7B / 13B / 30B, Vicuna-7B, OPT-6.7B—all open checkpoints.
  • Hardware: a single 96 GB RTX PRO 6000 Blackwell Max-Q. A 7B–13B reproduction should fit on a standard 80 GB A100; 30B may need light offloading.
  • Timing comparison is on the same hardware as SVD-LLM and Dobi-SVD, which is unusual rigor in this subfield.

Overall this is well above average reproducibility for compression papers. My recommended next-step replication: try ZS-SVD on a non-English calibration set (e.g., the multilingual subset of C4) and report the sign-flip rate of ∆Lᵢ relative to the WikiText2 calibration. That single experiment would resolve much of the open distribution-shift question.


References

  • Abbasi et al., Zero Sum SVD: Balancing Loss Sensitivity for Low Rank LLM Compression, arXiv 2602.02848v1 (2026).
  • Wang et al., SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression, ICLR 2025.
  • Qinsi et al., Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives, ICLR 2025.
  • Yuan et al., ASVD: Activation-aware Singular Value Decomposition for Compressing LLMs, arXiv 2312.05821 (2025).
  • Ding et al., Dip-SVD: Dual-importance Protected SVD for Efficient LLM Compression, arXiv 2506.20353 (2025).
  • Hsu et al., Language Model Compression with Weighted Low-Rank Factorization (FWSVD), ICLR 2022.
  • Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, ICLR 2022.
  • Zhao et al., GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, ICML 2024.
  • Balzano et al., An Overview of Low-Rank Structures in the Training and Adaptation of Large Models, arXiv 2503.19859 (2025).
  • Frantar et al., OPTQ: Accurate Quantization for Generative Pre-trained Transformers (GPTQ), ICLR 2023.
  • Sun et al., A Simple and Effective Pruning Approach for Large Language Models (Wanda), ICLR 2024.