DeepSeek-V2: Multi-head Latent Attention and DeepSeekMoE — Detailed Technical Review

Paper: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors: DeepSeek-AI
Affiliation: DeepSeek
Published: May 2024 (arXiv: 2405.04434)
Reviewer: Zhongzhu Zhou
Review Date: February 18, 2026

I. Prerequisites: What You Need to Know

This section covers every foundational concept needed to understand DeepSeek-V2's innovations. We will build from basic attention mechanics all the way to the KV cache bottleneck and low-rank compression.

1.1 The Transformer and Multi-Head Attention (MHA)

The Transformer architecture is the backbone of virtually all modern large language models. At its core, each Transformer block has two components: an attention module and a Feed-Forward Network (FFN).

The attention mechanism allows the model to "look at" all previous tokens when generating the next one. Specifically, Multi-Head Attention (MHA) works as follows:

Given the input hidden state $\mathbf{h}_t \in \mathbb{R}^d$ for the $t$ -th token (where $d$ is the embedding dimension), MHA computes:

$\mathbf{q}_t = W^Q \mathbf{h}_t, \quad \mathbf{k}_t = W^K \mathbf{h}_t, \quad \mathbf{v}_t = W^V \mathbf{h}_t$

where $W^Q, W^K, W^V \in \mathbb{R}^{n_h d_h \times d}$ are projection matrices, $n_h$ is the number of attention heads, and $d_h$ is the dimension per head.

These vectors are split into $n_h$ heads:

$[\mathbf{q}_{t,1}; \mathbf{q}_{t,2}; \ldots; \mathbf{q}_{t,n_h}] = \mathbf{q}_t$

Each head independently computes attention:

$\mathbf{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{\mathbf{q}_{t,i}^T \mathbf{k}_{j,i}}{\sqrt{d_h}}\right) \mathbf{v}_{j,i}$

The outputs are concatenated and projected:

$\mathbf{u}_t = W^O [\mathbf{o}_{t,1}; \mathbf{o}_{t,2}; \ldots; \mathbf{o}_{t,n_h}]$

Why it works: Each head can attend to different aspects of the input (syntactic relationships, semantic similarity, positional patterns). The multi-head structure gives the model a rich, multi-faceted view of the context.

1.2 The KV Cache Problem

During text generation (inference), the model produces tokens one at a time. At each step, it must compute attention over all previous tokens. Naively, this means recomputing keys and values for every previous token at every step—an $O(n^2)$ operation in sequence length $n$ .

The solution is the KV cache: store the keys and values from all previous tokens and reuse them. But this creates a memory bottleneck:

$\text{KV cache per token} = 2 \times n_h \times d_h \times l \text{ elements}$

where $l$ is the number of layers, and the factor of 2 accounts for both keys and values.

For a model like LLaMA-2 70B with $n_h = 64$ , $d_h = 128$ , and $l = 80$ :

$\text{KV cache per token} = 2 \times 64 \times 128 \times 80 = 1,310,720 \text{ elements}$

In float16, that is 2.5 MB per token. For a batch of 32 sequences at length 4096:

$32 \times 4096 \times 2.5 \text{ MB} \approx 320 \text{ GB}$

This exceeds the memory of even the largest GPUs! The KV cache is the primary bottleneck limiting batch size (throughput) and sequence length during inference.

1.3 Existing Solutions: GQA and MQA

Two approaches have been proposed to reduce the KV cache:

Multi-Query Attention (MQA): All attention heads share a single set of keys and values. Only the queries differ per head.

KV cache: $2 d_h l$ elements per token (reduced by factor $n_h$ )
Drawback: Significant performance loss because all heads see the same key-value representation

Grouped-Query Attention (GQA): Heads are divided into $n_g$ groups, where each group shares keys and values.

KV cache: $2 n_g d_h l$ elements per token
Tradeoff: GQA with fewer groups saves more memory but loses more performance

Mechanism	KV Cache per Token	Performance
MHA	$2 n_h d_h l$	Strong
GQA ( $n_g$ groups)	$2 n_g d_h l$	Moderate
MQA	$2 d_h l$	Weak

The fundamental problem: GQA and MQA reduce the KV cache by reducing the number of distinct key-value representations. This necessarily discards information. Can we compress the KV cache without losing information?

1.4 Low-Rank Approximation

A matrix $A \in \mathbb{R}^{m \times n}$ is low-rank if it can be well-approximated by the product of two smaller matrices:

$A \approx U V^T, \quad U \in \mathbb{R}^{m \times r}, \; V \in \mathbb{R}^{n \times r}$

where $r \ll \min(m, n)$ is the rank. Instead of storing $mn$ values, we store $(m + n) \times r$ values—a massive compression when $r$ is small.

Key insight: If the keys and values across attention heads are correlated (not fully independent), then the joint key-value matrix is approximately low-rank. We can compress it into a small latent vector and decompress it on the fly.

This is exactly what MLA does.

1.5 Rotary Position Embedding (RoPE)

Transformers need a way to encode the position of each token in the sequence. Rotary Position Embedding (RoPE) does this by applying a rotation matrix to the query and key vectors that depends on their position.

For position $t$ , RoPE applies:

$\mathbf{q}_t \leftarrow R_t \mathbf{q}_t, \quad \mathbf{k}_t \leftarrow R_t \mathbf{k}_t$

where $R_t$ is a block-diagonal rotation matrix determined by position $t$ . The attention score $\mathbf{q}_t^T \mathbf{k}_j$ then naturally depends on the relative position $t - j$ .

Why this matters for MLA: RoPE makes keys position-dependent. This creates a compatibility issue with low-rank compression, because the position information gets entangled with the compressed representation. MLA solves this with a clever "decoupled RoPE" strategy.

1.6 Mixture of Experts (MoE)

In a standard Transformer, every token passes through the same FFN with the same parameters. A Mixture of Experts (MoE) model has many FFN "experts" but only activates a few for each token:

$\text{output} = \sum_{i=1}^{N} g_i \cdot \text{FFN}_i(\mathbf{x}), \quad g_i = \begin{cases} s_i & \text{if } i \in \text{Top-}K \\ 0 & \text{otherwise}\end{cases}$

where $s_i$ is a routing score computed by a small gating network, and only the top- $K$ experts are activated.

Benefits: The model has many parameters (knowledge capacity) but only uses a fraction per token (computational cost). This decouples model size from inference cost.

II. What Does This Paper Do?

DeepSeek-V2 introduces two architectural innovations for the Transformer:

Multi-head Latent Attention (MLA): A new attention mechanism that compresses keys and values into a low-rank latent vector, reducing the KV cache by 93.3% while actually improving performance compared to standard MHA.
DeepSeekMoE: An enhanced MoE architecture with fine-grained expert segmentation and shared expert isolation that achieves strong performance at economical training costs.

The resulting model has 236B total parameters (21B activated per token), supports 128K context length, and achieves:

42.5% training cost reduction compared to DeepSeek 67B
93.3% KV cache reduction
5.76× maximum generation throughput
Top-tier performance among open-source models

III. Multi-head Latent Attention (MLA): The Core Innovation

MLA is the paper's most significant technical contribution. Let us understand it step by step.

3.1 Low-Rank Key-Value Joint Compression

The core idea: instead of storing separate key and value vectors for each head, compress them into a single latent vector $\mathbf{c}_t^{KV}$ that is much smaller.

Compression (down-projection):

$\mathbf{c}_t^{KV} = W^{DKV} \mathbf{h}_t$

where $\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c}$ is the compressed latent vector, $d_c \ll n_h d_h$ is the KV compression dimension, and $W^{DKV} \in \mathbb{R}^{d_c \times d}$ is the down-projection matrix.

Decompression (up-projection):

$\mathbf{k}_t^C = W^{UK} \mathbf{c}_t^{KV}, \quad \mathbf{v}_t^C = W^{UV} \mathbf{c}_t^{KV}$

where $W^{UK}, W^{UV} \in \mathbb{R}^{n_h d_h \times d_c}$ are up-projection matrices.

The key insight for inference: During inference, only $\mathbf{c}_t^{KV}$ needs to be cached — not the full keys and values! The KV cache shrinks from $2 n_h d_h l$ elements per token to just $d_c l$ elements.

But it gets even better: During inference, the up-projection $W^{UK}$ can be absorbed into the query projection $W^Q$ . Similarly, $W^{UV}$ can be absorbed into the output projection $W^O$ . This means we do not even need to decompress the keys and values explicitly! The attention is computed directly on the compressed representations.

Why absorption works: Consider the attention score computation. Instead of:

$\mathbf{q}_{t,i}^T \mathbf{k}_{j,i}^C = \mathbf{q}_{t,i}^T [W^{UK}]_i \mathbf{c}_j^{KV}$

we can precompute $\tilde{\mathbf{q}}_{t,i} = [W^{UK}]_i^T \mathbf{q}_{t,i}$ (absorbing $W^{UK}$ into the query computation), then compute:

$\tilde{\mathbf{q}}_{t,i}^T \mathbf{c}_j^{KV}$

This is a dot product in the compressed $d_c$ -dimensional space, which is both faster and uses less memory.

3.2 Query Compression (Training Optimization)

MLA also compresses queries (even though this does not reduce the KV cache):

$\mathbf{c}_t^Q = W^{DQ} \mathbf{h}_t, \quad \mathbf{q}_t^C = W^{UQ} \mathbf{c}_t^Q$

where $\mathbf{c}_t^Q \in \mathbb{R}^{d_c'}$ is the compressed query latent vector and $d_c' \ll n_h d_h$ .

This reduces activation memory during training (the stored activations needed for backpropagation), since only the smaller compressed vector needs to be saved, not the full query.

3.3 The RoPE Incompatibility Problem

Here is where things get interesting. RoPE requires applying position-dependent rotations to keys. But if we apply RoPE to the compressed keys $\mathbf{k}_t^C$ , then:

$\text{Attention score} = \mathbf{q}_{t,i}^T \cdot \text{RoPE}(W^{UK} \mathbf{c}_j^{KV})$

The RoPE rotation matrix sits between $W^{UK}$ and $\mathbf{c}_j^{KV}$ , coupling them. Since RoPE depends on position $j$ (which changes for each cached token), we can no longer absorb $W^{UK}$ into $W^Q$ . We would need to recompute all keys for every generation step — destroying the efficiency gains!

3.4 Decoupled RoPE: The Elegant Solution

MLA solves this with decoupled RoPE: use separate, small vectors dedicated solely to carrying position information:

Additional RoPE queries (per head):

$[\mathbf{q}_{t,1}^R; \mathbf{q}_{t,2}^R; \ldots; \mathbf{q}_{t,n_h}^R] = \mathbf{q}_t^R = \text{RoPE}(W^{QR} \mathbf{c}_t^Q)$

Additional RoPE key (shared across heads):

$\mathbf{k}_t^R = \text{RoPE}(W^{KR} \mathbf{h}_t)$

where $\mathbf{q}_{t,i}^R, \mathbf{k}_t^R \in \mathbb{R}^{d_h^R}$ and $d_h^R$ is a small per-head dimension for the decoupled position signal.

Combined queries and keys:

$\mathbf{q}_{t,i} = [\mathbf{q}_{t,i}^C;\; \mathbf{q}_{t,i}^R], \quad \mathbf{k}_{t,i} = [\mathbf{k}_{t,i}^C;\; \mathbf{k}_t^R]$

Attention computation:

$\mathbf{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{\mathbf{q}_{t,i}^T \mathbf{k}_{j,i}}{\sqrt{d_h + d_h^R}}\right) \mathbf{v}_{j,i}^C$

Why this works: The content-based attention ( $\mathbf{q}^C$ and $\mathbf{k}^C$ ) operates on the compressed representations and can absorb the up-projection matrices. The position-based attention ( $\mathbf{q}^R$ and $\mathbf{k}^R$ ) carries RoPE independently in a small, separate space. The two are concatenated, so the attention score is the sum of content similarity and positional relevance.

Total KV cache: During inference, we cache $\mathbf{c}_t^{KV}$ (for content) and $\mathbf{k}_t^R$ (for position). The total cache per token is:

$(d_c + d_h^R) \times l \text{ elements}$

3.5 Concrete Numbers

For DeepSeek-V2:

$n_h = 128$ attention heads
$d_h = 128$ dimension per head
$d_c = 512$ (KV compression dimension = $4 d_h$ )
$d_h^R = 64$ (decoupled RoPE dimension = $\frac{d_h}{2}$ )
$d_c' = 1536$ (query compression dimension)
$l = 60$ layers

KV cache comparison:

Mechanism	Cache per Token	Relative Size
MHA	$2 \times 128 \times 128 \times 60 = 1,966,080$	100%
GQA (8 groups)	$2 \times 8 \times 128 \times 60 = 122,880$	6.25%
MQA	$2 \times 128 \times 60 = 15,360$	0.78%
MLA	$(512 + 64) \times 60 = 34,560$	1.76%

MLA's cache is equivalent to GQA with only 2.25 groups, yet its performance is stronger than MHA. This is the paper's central claim: MLA breaks the tradeoff between cache size and attention quality.

The 93.3% reduction: Compared to DeepSeek 67B (which uses standard MHA), the KV cache per token drops from the MHA baseline to MLA: $(1 - \frac{34,560}{1,966,080}) \times 100\% \approx 98.2\%$ . The paper's reported 93.3% reduction accounts for the specific configurations of both DeepSeek 67B and DeepSeek-V2.

IV. DeepSeekMoE: The FFN Architecture

4.1 Fine-Grained Expert Segmentation

Standard MoE (e.g., GShard) uses a small number of large experts. DeepSeekMoE segments experts into finer granularity: 160 routed experts (each with intermediate hidden dimension 1536) plus 2 shared experts.

Why finer is better: Smaller experts can specialize more precisely. Instead of one large expert handling "all of mathematics," multiple small experts can specialize in algebra, geometry, calculus, etc. This enables more accurate knowledge acquisition.

4.2 Shared Expert Isolation

DeepSeekMoE isolates a few shared experts that process every token, regardless of routing decisions:

$\mathbf{h}_t' = \mathbf{u}_t + \sum_{i=1}^{N_s} \text{FFN}_i^{(s)}(\mathbf{u}_t) + \sum_{i=1}^{N_r} g_{i,t} \text{FFN}_i^{(r)}(\mathbf{u}_t)$

where $N_s = 2$ shared experts and $N_r = 160$ routed experts (of which $K_r = 6$ are activated per token).

Purpose: Shared experts capture common knowledge that all tokens need (e.g., basic syntax, formatting), while routed experts specialize. This reduces knowledge redundancy among routed experts.

4.3 Routing Mechanism

The routing score for token $t$ and expert $i$ :

$s_{i,t} = \text{Softmax}_i(\mathbf{u}_t^T \mathbf{e}_i)$

where $\mathbf{e}_i$ is the centroid of the $i$ -th expert. The top- $K_r$ experts are selected, and their outputs are weighted by their routing scores.

4.4 Device-Limited Routing

With 160 experts distributed across multiple GPUs, a token might need to be sent to many devices, creating expensive communication. DeepSeek-V2 limits each token to experts on at most $M = 3$ devices:

First select the $M$ devices with the highest-scoring experts
Then perform top- $K_r$ selection only among experts on those devices

This bounds communication costs while maintaining nearly the same performance as unrestricted routing (empirically, $M \geq 3$ is sufficient).

4.5 Load Balancing: Three Auxiliary Losses

Unbalanced routing can cause "routing collapse" (some experts are never used) and computation waste. DeepSeek-V2 uses three complementary loss terms:

Expert-level balance loss (prevents routing collapse):

$\mathcal{L}_{\text{ExpBal}} = \alpha_1 \sum_{i=1}^{N_r} f_i P_i$

where $f_i$ is the fraction of tokens routed to expert $i$ , and $P_i$ is the average routing probability for expert $i$ .

Device-level balance loss (ensures balanced computation across GPUs):

$\mathcal{L}_{\text{DevBal}} = \alpha_2 \sum_{i=1}^{D} f_i' P_i'$

where the sum is over devices rather than individual experts.

Communication balance loss (ensures balanced inter-device data transfer):

$\mathcal{L}_{\text{CommBal}} = \alpha_3 \sum_{i=1}^{D} f_i'' P_i''$

where $f_i''$ measures the fraction of tokens sent to device $i$ .

The hyperparameters are set to $\alpha_1 = 0.003$ , $\alpha_2 = 0.05$ , $\alpha_3 = 0.02$ .

4.6 Token-Dropping Strategy

As a complementary mechanism, DeepSeek-V2 drops tokens with the lowest affinity scores on each device when the load exceeds a capacity factor of 1.0. Approximately 10% of training sequences are protected from dropping to maintain consistency. During inference, no tokens are dropped.

V. Training and Infrastructure

5.1 Pre-Training Data

8.1T tokens from a high-quality, multi-source corpus
Chinese tokens are approximately 12% more than English tokens
Tokenizer: Byte-level BPE with 100K vocabulary (same as DeepSeek 67B)
Data quality improvements: recovered mistakenly deleted data, enriched with more Chinese data, improved quality-based filtering

5.2 Training Configuration

60 Transformer layers, hidden dimension 5120
First layer uses a dense FFN; all subsequent layers use MoE
Optimizer: AdamW ( $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , weight decay = 0.1)
Learning rate: Warmup-and-step-decay; max LR $= 2.4 \times 10^{-4}$
Batch size: Gradually increased from 2304 to 9216 over the first 225B tokens
Sequence length: 4096 during pre-training

5.3 Infrastructure

Built on HAI-LLM, an internal training framework featuring:

16-way zero-bubble pipeline parallelism
8-way expert parallelism (each MoE layer's routed experts spread across 8 devices)
ZeRO-1 data parallelism
No tensor parallelism needed (due to small activated parameter count)
Shared expert computation overlapped with expert parallel all-to-all communication

5.4 Long Context Extension

After pre-training at 4K context, the model is extended to 128K using YaRN (Yet another RoPE extensioN method):

Additional training on 1000 steps with sequence length 32K
RoPE base frequency increased from 10,000 to 160,000
Scaling factor $s = 40$ , targeting 128K context length

VI. Evaluation Results

6.1 Pre-Training Performance

DeepSeek-V2 (21B activated parameters) vs. other open-source models:

Benchmark	DeepSeek-V2	DeepSeek 67B	LLaMA 3 70B	Mixtral 8×22B
MMLU	78.5	71.3	79.5	77.8
BBH	78.9	68.7	81.0	78.9
HumanEval	48.8	45.1	48.2	46.3
MATH	43.6	18.7	42.5	41.7
GSM8K	79.2	63.4	83.0	78.6

Despite having only 21B activated parameters (vs. 70B for LLaMA 3 70B), DeepSeek-V2 achieves competitive performance, especially excelling in math and code.

6.2 Efficiency Gains

Compared to DeepSeek 67B:

Metric	Improvement
Training cost	42.5% reduction
KV cache	93.3% reduction
Maximum generation throughput	5.76× increase

6.3 Alignment Performance

After SFT (1.5M conversations) and RL (GRPO — Group Relative Policy Optimization):

Benchmark	DeepSeek-V2 Chat (RL)
AlpacaEval 2.0 (LC win rate)	38.9%
MT-Bench	8.97/10
AlignBench (Chinese)	7.91/10

DeepSeek-V2 Chat (RL) achieves top-tier performance among open-source chat models, and in Chinese, even outperforms most closed-source models.

6.4 Ablation: MLA vs. MHA

The paper provides an ablation comparing MLA directly to MHA (Appendix D.2). In a 16B model:

MLA achieves higher validation loss (better performance) than MHA with the same training compute
MLA is also better than GQA and MQA at all group configurations

This confirms that MLA is not just about efficiency — it actually improves model quality, likely because the low-rank constraint acts as a beneficial inductive bias that encourages the model to learn more structured, generalizable key-value representations.

VII. Limitations and Discussion

7.1 Architectural Limitations

Fixed compression dimension: The choice of $d_c = 512$ is a hyperparameter. Too small loses information; too large wastes memory. Adaptive compression could be explored.
RoPE workaround: The decoupled RoPE strategy adds complexity and a small amount of extra cache ( $d_h^R l$ per token). Alternative position encoding methods might integrate more naturally with low-rank compression.
Expert parallelism overhead: Despite device-limited routing, the communication costs of MoE models remain non-trivial at scale. The three auxiliary losses add tuning complexity.

7.2 Training Limitations

Closed training data: The 8.1T token corpus details are vague. Reproducibility is limited by the undisclosed data pipeline.
Evaluation scope: While broad, some evaluations may favor the model's training distribution (especially Chinese benchmarks).
Alignment methodology: The paper uses GRPO for RL alignment but provides limited details on the reward model or preference data.

7.3 Scaling Questions

How does MLA scale to even larger models (e.g., the subsequent DeepSeek-V3)?
Is the optimal $d_c / (n_h d_h)$ ratio consistent across model sizes?
Can MLA be combined with other efficiency techniques (quantization, speculative decoding)?

VIII. Impact and Significance

8.1 Why MLA Matters

MLA represents a paradigm shift in attention mechanism design:

Breaks the KV cache vs. quality tradeoff: Previous methods (GQA, MQA) sacrificed quality for efficiency. MLA achieves both simultaneously by exploiting the low-rank structure of key-value representations.
Practical impact on deployment: A 93.3% KV cache reduction means dramatically higher throughput, larger batch sizes, and longer context support—directly reducing inference costs.
Conceptual contribution: The idea that keys and values across heads are highly correlated (and thus compressible) is a fundamental insight about how attention works in practice.

8.2 Influence on Subsequent Work

MLA has been adopted and extended in:

DeepSeek-V3: The successor model, which scales MLA to 671B total parameters
DeepSeek-R1: The reasoning model, confirming MLA's robustness across tasks
Community implementations: MLA has inspired research into efficient attention mechanisms, with several groups exploring variants

8.3 MLA in the Broader Efficiency Landscape

MLA complements other efficiency techniques:

Technique	What It Optimizes	Compatible with MLA?
Quantization (INT8/INT4)	Memory per element	✅ Yes — compress the latent vector further
Flash Attention	Compute (IO-aware)	✅ Yes — applies to the attention computation
Speculative decoding	Latency	✅ Yes — smaller KV cache speeds up verification
GQA/MQA	KV cache (by head sharing)	❌ Replaced by MLA
KV cache eviction	Memory (by dropping old KV)	✅ Yes — but MLA reduces the need

IX. Reproducibility

Criterion	Assessment
Code availability	✅ Model checkpoints at github.com/deepseek-ai/DeepSeek-V2
Architecture details	✅ Complete formulas in paper and appendix
Hyperparameters	✅ All model and training hyperparameters specified
Training data	❌ 8.1T token corpus not publicly released; composition not detailed
Training infrastructure	⚠️ Internal framework (HAI-LLM); general approach described
Ablation studies	✅ MHA vs. GQA vs. MQA vs. MLA comparisons provided
Evaluation details	✅ Benchmark settings and prompts specified
Reproducibility risk	High — training data and infrastructure are proprietary

The architecture is well-documented and reproducible. The training process is not, due to proprietary data and infrastructure. However, the model checkpoints are publicly available, enabling downstream use and evaluation.

X. Summary: Key Takeaways

MLA's core trick: Compress keys and values into a shared low-rank latent vector $\mathbf{c}_t^{KV}$ . Cache only this small vector instead of full keys and values. Absorb the decompression matrices into query/output projections to avoid even computing full keys/values during inference.
Decoupled RoPE solves the position encoding problem: By carrying position information in a separate, small vector ( $\mathbf{k}_t^R$ ), MLA avoids the incompatibility between RoPE and low-rank compression. This is a clever engineering solution to a real mathematical obstacle.
93.3% KV cache reduction with better performance: MLA caches only $(d_c + d_h^R) l$ elements per token, equivalent to GQA with 2.25 groups, yet outperforms full MHA. The low-rank constraint likely acts as a beneficial regularizer.
DeepSeekMoE enables economical scaling: Fine-grained expert segmentation (160 experts, 6 activated) with shared experts achieves better performance than coarse-grained MoE at equivalent cost. Device-limited routing and three-tier load balancing ensure practical efficiency.
The numbers are impressive: 236B parameters, 21B activated, 8.1T training tokens, 128K context, 42.5% training cost reduction, 5.76× throughput improvement — all while matching or exceeding 70B-class dense models.
MLA is the ancestor of DeepSeek-V3 and R1: This architecture proved so effective that it became the foundation for DeepSeek's subsequent models, which have pushed the frontier of open-source LLMs.
The key lesson: In attention mechanisms, the assumption that each head needs independent keys and values is wasteful. The real information content of key-value representations is much lower-dimensional than the raw vector space suggests, and exploiting this structure yields dramatic efficiency gains with no quality penalty.