Converting pretrained GQA/MHA models to MLA at KV-parity through activation-aware factorization and adjusted-rank scheduling — no extra KV-cache cost.
1University of Sydney
2KAUST
3Together AI
4UT Austin
* Equal advising
TL;DR: CARE converts pretrained GQA/MHA to MLA at KV-parity via covariance-aware SVD and adjusted-rank allocation, reducing perplexity up to 215× and improving accuracy up to 21 points over baselines on Llama-3.1-8B/70B and Qwen3-4B/30B-A3B.
Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers — causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged.
A quick visual intuition for why naive MLA transfer can fail, and where CARE changes the story.
This comic compresses the paper's motivation into one quick read: direct SVD can leave an MLA gap, naive joint truncation can amplify instability, and covariance-aware guidance gives the decomposition a much better direction.
Click the image to enlarge. The portrait artwork is kept separate from the explanatory text so the comic itself stays fully visible.
Covariance-aware factorization that minimizes activation error $\|XW - X\hat{W}\|$ (rather than weight error), via SVD on a whitened operator $\sqrt{C}W$ and subsequent unwhitening. This preserves attention logits more faithfully.
Singular-value-guided allocation that distributes rank unevenly across layers and K/V matrices under a fixed KV width, matching spectral difficulty and improving zero-shot fidelity vs. uniform ranks.
A practical KV-parity reparameterization for MLA conversion. CARE-converted models exhibit lower activation error and improved task quality over naive SVD baselines at equal KV cost.
CARE addresses two fundamental shortcomings of naive SVD-based MLA transfer through a principled three-step pipeline.
Figure 1. (a) Naive MLA transfer: jointly factorize $W_K$ and $W_V$ by SVD and truncate to a uniform per-layer rank. (b) CARE: estimate activation covariance $C$, factorize $\sqrt{C}W$, unwhiten via $\sqrt{C}^{-1}$ to initialize MLA factors, and use the singular spectrum for global dynamic rank scheduling under KV parity.
Collect input activations from a small calibration set (256 samples) and compute the per-layer covariance matrix $C^{(l)} = \frac{1}{N}\sum_{b}(X_b^{(l)})^\top X_b^{(l)}$, which captures how the model actually uses each projection during inference.
Instead of factoring $W$ directly, apply SVD to the whitened matrix $\sqrt{C}W = U\Sigma V^\top$, then recover $\hat{W} = \sqrt{C}^{-1}U_r\Sigma_r V_r^\top$. This minimizes the activation error rather than weight-space error, preserving dominant activation directions.
Distribute a fixed KV budget across layers using greedy water-filling: score each layer by its normalized residual reduction and allocate more rank to spectrally complex layers while compressing easy layers more aggressively.
One-shot comparison on Llama-3.1-8B-Instruct and Qwen3-4B-Instruct at various rank budgets (calibration: 256 samples, seq len: 2048, Alpaca).
| Rank | KV Save | Method | PPL ↓ | ARC ↑ | ARE ↑ | Hella ↑ | PIQA ↑ | MMLU ↑ | OBQA ↑ | AVG ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| Llama-3.1-8B-Instruct | ||||||||||
| — | — | GQA (Original) | 7.21 | 50.34 | 80.18 | 60.15 | 79.65 | 48.05 | 34.80 | 58.24 |
| 256 | 75% | MHA2MLA | 1633 | 27.47 | 25.63 | 26.50 | 52.39 | 23.10 | 28.20 | 32.08 |
| SVD-LLM V2 | 47.28 | 36.26 | 60.69 | 56.39 | 72.20 | 46.57 | 33.40 | 50.13 | ||
| CARE-E (Ours) | 49.43 | 38.48 | 60.06 | 60.54 | 72.42 | 54.29 | 33.60 | 52.62 | ||
| 512 | 50% | MHA2MLA | 220 | 25.94 | 40.95 | 39.27 | 61.37 | 25.54 | 26.60 | 37.88 |
| SVD-LLM V2 | 9.63 | 52.39 | 76.68 | 73.57 | 78.45 | 62.31 | 40.40 | 62.21 | ||
| CARE-U (Ours) | 9.64 | 52.73 | 76.30 | 73.98 | 78.73 | 62.17 | 40.60 | 62.33 | ||
| Qwen3-4B-Instruct-2507 | ||||||||||
| — | — | GQA (Original) | 10.04 | 55.89 | 83.12 | 52.65 | 76.01 | 73.37 | 32.00 | 60.30 |
| 256 | 75% | MHA2MLA | 44509 | 22.18 | 28.49 | 28.92 | 52.45 | 23.00 | 24.60 | 31.94 |
| SVD-LLM V2 | 22.88 | 44.54 | 67.17 | 57.77 | 70.78 | 52.81 | 35.60 | 53.12 | ||
| CARE-U (Ours) | 22.08 | 46.42 | 68.90 | 59.16 | 71.55 | 54.76 | 36.40 | 54.33 | ||
| 512 | 50% | MHA2MLA | 100.99 | 27.05 | 41.08 | 37.97 | 59.19 | 29.14 | 27.20 | 38.15 |
| SVD-LLM V2 | 11.88 | 54.61 | 77.44 | 68.53 | 75.68 | 67.65 | 39.80 | 61.14 | ||
| CARE-U (Ours) | 12.03 | 54.95 | 77.23 | 69.24 | 76.22 | 67.46 | 40.00 | 61.62 | ||
With a brief post-SVD "healing" fine-tune, CARE fully recovers the original model's accuracy.
| Method | Tokens | ARC ↑ | ARE ↑ | Hella ↑ | PIQA ↑ | MMLU ↑ | OBQA ↑ | WG ↑ | AVG ↑ |
|---|---|---|---|---|---|---|---|---|---|
| GQA (Original) | — | 50.34 | 80.18 | 60.15 | 79.65 | 48.05 | 34.80 | 72.69 | 58.24 |
| CARE One-Shot | 0B | 52.73 | 76.30 | 73.98 | 78.73 | 62.17 | 40.60 | 72.61 | 62.33 |
| Palu (SVD) | 3B | 44.56 | 74.96 | 52.20 | 76.63 | 61.08 | 30.40 | 65.42 | 56.30 |
| TransMLA | 3B | 53.77 | 82.34 | 56.44 | 80.70 | 70.23 | 33.30 | 72.47 | 61.86 |
| TransMLA + CARE Init (Ours) | 3B | 51.75 | 80.73 | 64.45 | 83.23 | 71.57 | 34.00 | 74.09 | 63.27 |
If you find this work useful, please cite our paper.
@article{zhou2025care,
title={CARE: Covariance-Aware and Rank-Enhanced Decomposition
for Enabling Multi-Head Latent Attention},
author={Zhou, Zhongzhu and Bie, Fengxiang and Chen, Ziyan
and Zhang, Zhenyu and Yang, Yibo and Wang, Junxiong
and Athiwaratkun, Ben and Wu, Xiaoxia
and Song, Shuaiwen Leon},
journal={arXiv preprint arXiv:2507.12738},
year={2025}
}