ICLR 2026

CARE: Covariance-Aware and Rank-Enhanced
Decomposition for Enabling Multi-Head Latent Attention

Converting pretrained GQA/MHA models to MLA at KV-parity through activation-aware factorization and adjusted-rank scheduling — no extra KV-cache cost.

1University of Sydney    2KAUST    3Together AI    4UT Austin
* Equal advising

215×
Perplexity reduction
over baselines
+21
Mean accuracy points
improvement
93.75%
Max KV-cache
savings
4
Model families
validated
Abstract

TL;DR: CARE converts pretrained GQA/MHA to MLA at KV-parity via covariance-aware SVD and adjusted-rank allocation, reducing perplexity up to 215× and improving accuracy up to 21 points over baselines on Llama-3.1-8B/70B and Qwen3-4B/30B-A3B.

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers — causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged.


CARE in One Comic

A quick visual intuition for why naive MLA transfer can fail, and where CARE changes the story.

Comic Spotlight

A visual story of the failure mode behind naive MLA transfer.

This comic compresses the paper's motivation into one quick read: direct SVD can leave an MLA gap, naive joint truncation can amplify instability, and covariance-aware guidance gives the decomposition a much better direction.

Direct SVD leaves a gap Weight-only factorization can miss the activation directions that matter during inference.
Joint truncation compounds error Compressing K and V together without guidance can create unpredictable degradation.
CARE adds covariance guidance The conversion is steered toward activation-preserving structure rather than raw weight similarity.

Click the image to enlarge. The portrait artwork is kept separate from the explanatory text so the comic itself stays fully visible.


Key Contributions
Contribution 1

Activation-Aware Initialization

Covariance-aware factorization that minimizes activation error $\|XW - X\hat{W}\|$ (rather than weight error), via SVD on a whitened operator $\sqrt{C}W$ and subsequent unwhitening. This preserves attention logits more faithfully.

Contribution 2

Rank-Adaptive Scheduling

Singular-value-guided allocation that distributes rank unevenly across layers and K/V matrices under a fixed KV width, matching spectral difficulty and improving zero-shot fidelity vs. uniform ranks.

Contribution 3

KV-Parity Mapping & Pipeline

A practical KV-parity reparameterization for MLA conversion. CARE-converted models exhibit lower activation error and improved task quality over naive SVD baselines at equal KV cost.


Method

CARE addresses two fundamental shortcomings of naive SVD-based MLA transfer through a principled three-step pipeline.

CARE Pipeline Overview

Figure 1. (a) Naive MLA transfer: jointly factorize $W_K$ and $W_V$ by SVD and truncate to a uniform per-layer rank. (b) CARE: estimate activation covariance $C$, factorize $\sqrt{C}W$, unwhiten via $\sqrt{C}^{-1}$ to initialize MLA factors, and use the singular spectrum for global dynamic rank scheduling under KV parity.

1

Activation Covariance Estimation

Collect input activations from a small calibration set (256 samples) and compute the per-layer covariance matrix $C^{(l)} = \frac{1}{N}\sum_{b}(X_b^{(l)})^\top X_b^{(l)}$, which captures how the model actually uses each projection during inference.

2

Covariance-Aware SVD

Instead of factoring $W$ directly, apply SVD to the whitened matrix $\sqrt{C}W = U\Sigma V^\top$, then recover $\hat{W} = \sqrt{C}^{-1}U_r\Sigma_r V_r^\top$. This minimizes the activation error rather than weight-space error, preserving dominant activation directions.

3

Adjusted-Rank Scheduling

Distribute a fixed KV budget across layers using greedy water-filling: score each layer by its normalized residual reduction and allocate more rank to spectrally complex layers while compressing easy layers more aggressively.

Motivation: layer sensitivity and singular value importance

Figure 2. (a) Accuracy under 50% rank reduction applied one layer at a time — sensitivity is strongly layer-dependent. (b) WikiText perplexity under grouped singular-value truncation — the non-monotone degradation shows that singular values alone are an imperfect proxy for MLA conversion quality.


Results

One-shot comparison on Llama-3.1-8B-Instruct and Qwen3-4B-Instruct at various rank budgets (calibration: 256 samples, seq len: 2048, Alpaca).

Rank KV Save Method PPL ↓ ARC ↑ ARE ↑ Hella ↑ PIQA ↑ MMLU ↑ OBQA ↑ AVG ↑
Llama-3.1-8B-Instruct
GQA (Original) 7.2150.3480.1860.1579.6548.0534.8058.24
25675%MHA2MLA 163327.4725.6326.5052.3923.1028.2032.08
SVD-LLM V2 47.2836.2660.6956.3972.2046.5733.4050.13
CARE-E (Ours) 49.4338.4860.0660.5472.4254.2933.6052.62
51250%MHA2MLA 22025.9440.9539.2761.3725.5426.6037.88
SVD-LLM V2 9.6352.3976.6873.5778.4562.3140.4062.21
CARE-U (Ours) 9.6452.7376.3073.9878.7362.1740.6062.33
Qwen3-4B-Instruct-2507
GQA (Original) 10.0455.8983.1252.6576.0173.3732.0060.30
25675%MHA2MLA 4450922.1828.4928.9252.4523.0024.6031.94
SVD-LLM V2 22.8844.5467.1757.7770.7852.8135.6053.12
CARE-U (Ours) 22.0846.4268.9059.1671.5554.7636.4054.33
51250%MHA2MLA 100.9927.0541.0837.9759.1929.1427.2038.15
SVD-LLM V2 11.8854.6177.4468.5375.6867.6539.8061.14
CARE-U (Ours) 12.0354.9577.2369.2476.2267.4640.0061.62
Rank profiles across calibration corpora

Figure 3. Covariance-aware rank profiles for Llama-3.1-8B-Instruct across calibration corpora (Alpaca). Both $W_K$ and $W_V$ show a depth-dependent pattern — larger ranks in early layers, decreasing with depth — suggesting the profile is model-intrinsic.

Accuracy vs calibration samples

Figure 4. One-shot accuracy versus calibration samples and sequence length. Performance saturates beyond 512 samples, and CARE achieves strong results even with minimal calibration data (256 samples, seq len 32).


100% MLA Recovery with Healing

With a brief post-SVD "healing" fine-tune, CARE fully recovers the original model's accuracy.

Method Tokens ARC ↑ ARE ↑ Hella ↑ PIQA ↑ MMLU ↑ OBQA ↑ WG ↑ AVG ↑
GQA (Original) 50.3480.1860.1579.6548.0534.8072.6958.24
CARE One-Shot0B 52.7376.3073.9878.7362.1740.6072.6162.33
Palu (SVD)3B 44.5674.9652.2076.6361.0830.4065.4256.30
TransMLA3B 53.7782.3456.4480.7070.2333.3072.4761.86
TransMLA + CARE Init (Ours)3B 51.7580.7364.4583.2371.5734.0074.0963.27

Citation

If you find this work useful, please cite our paper.

@article{zhou2025care,
  title={CARE: Covariance-Aware and Rank-Enhanced Decomposition
         for Enabling Multi-Head Latent Attention},
  author={Zhou, Zhongzhu and Bie, Fengxiang and Chen, Ziyan
          and Zhang, Zhenyu and Yang, Yibo and Wang, Junxiong
          and Athiwaratkun, Ben and Wu, Xiaoxia
          and Song, Shuaiwen Leon},
  journal={arXiv preprint arXiv:2507.12738},
  year={2025}
}