0%

DeepSeek-V2: Multi-head Latent Attention and DeepSeekMoE — Technical Review

DeepSeek-V2: Multi-head Latent Attention and DeepSeekMoE — Detailed Technical Review

Paper: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors: DeepSeek-AI
Affiliation: DeepSeek
Published: May 2024 (arXiv: 2405.04434)
Reviewer: Zhongzhu Zhou
Review Date: February 18, 2026


I. Prerequisites: What You Need to Know

This section covers every foundational concept needed to understand DeepSeek-V2's innovations. We will build from basic attention mechanics all the way to the KV cache bottleneck and low-rank compression.

1.1 The Transformer and Multi-Head Attention (MHA)

The Transformer architecture is the backbone of virtually all modern large language models. At its core, each Transformer block has two components: an attention module and a Feed-Forward Network (FFN).

The attention mechanism allows the model to "look at" all previous tokens when generating the next one. Specifically, Multi-Head Attention (MHA) works as follows:

Given the input hidden state htRd\mathbf{h}_t \in \mathbb{R}^d for the tt-th token (where dd is the embedding dimension), MHA computes:

qt=WQht,kt=WKht,vt=WVht\mathbf{q}_t = W^Q \mathbf{h}_t, \quad \mathbf{k}_t = W^K \mathbf{h}_t, \quad \mathbf{v}_t = W^V \mathbf{h}_t

where WQ,WK,WVRnhdh×dW^Q, W^K, W^V \in \mathbb{R}^{n_h d_h \times d} are projection matrices, nhn_h is the number of attention heads, and dhd_h is the dimension per head.

These vectors are split into nhn_h heads:

[qt,1;qt,2;;qt,nh]=qt[\mathbf{q}_{t,1}; \mathbf{q}_{t,2}; \ldots; \mathbf{q}_{t,n_h}] = \mathbf{q}_t

Each head independently computes attention:

ot,i=j=1tSoftmaxj(qt,iTkj,idh)vj,i\mathbf{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{\mathbf{q}_{t,i}^T \mathbf{k}_{j,i}}{\sqrt{d_h}}\right) \mathbf{v}_{j,i}

The outputs are concatenated and projected:

ut=WO[ot,1;ot,2;;ot,nh]\mathbf{u}_t = W^O [\mathbf{o}_{t,1}; \mathbf{o}_{t,2}; \ldots; \mathbf{o}_{t,n_h}]

Why it works: Each head can attend to different aspects of the input (syntactic relationships, semantic similarity, positional patterns). The multi-head structure gives the model a rich, multi-faceted view of the context.

1.2 The KV Cache Problem

During text generation (inference), the model produces tokens one at a time. At each step, it must compute attention over all previous tokens. Naively, this means recomputing keys and values for every previous token at every step—an O(n2)O(n^2) operation in sequence length nn.

The solution is the KV cache: store the keys and values from all previous tokens and reuse them. But this creates a memory bottleneck:

KV cache per token=2×nh×dh×l elements\text{KV cache per token} = 2 \times n_h \times d_h \times l \text{ elements}

where ll is the number of layers, and the factor of 2 accounts for both keys and values.

For a model like LLaMA-2 70B with nh=64n_h = 64, dh=128d_h = 128, and l=80l = 80:

KV cache per token=2×64×128×80=1,310,720 elements\text{KV cache per token} = 2 \times 64 \times 128 \times 80 = 1,310,720 \text{ elements}

In float16, that is 2.5 MB per token. For a batch of 32 sequences at length 4096:

32×4096×2.5 MB320 GB32 \times 4096 \times 2.5 \text{ MB} \approx 320 \text{ GB}

This exceeds the memory of even the largest GPUs! The KV cache is the primary bottleneck limiting batch size (throughput) and sequence length during inference.

1.3 Existing Solutions: GQA and MQA

Two approaches have been proposed to reduce the KV cache:

Multi-Query Attention (MQA): All attention heads share a single set of keys and values. Only the queries differ per head.

  • KV cache: 2dhl2 d_h l elements per token (reduced by factor nhn_h)
  • Drawback: Significant performance loss because all heads see the same key-value representation

Grouped-Query Attention (GQA): Heads are divided into ngn_g groups, where each group shares keys and values.

  • KV cache: 2ngdhl2 n_g d_h l elements per token
  • Tradeoff: GQA with fewer groups saves more memory but loses more performance
Mechanism KV Cache per Token Performance
MHA 2nhdhl2 n_h d_h l Strong
GQA (ngn_g groups) 2ngdhl2 n_g d_h l Moderate
MQA 2dhl2 d_h l Weak

The fundamental problem: GQA and MQA reduce the KV cache by reducing the number of distinct key-value representations. This necessarily discards information. Can we compress the KV cache without losing information?

1.4 Low-Rank Approximation

A matrix ARm×nA \in \mathbb{R}^{m \times n} is low-rank if it can be well-approximated by the product of two smaller matrices:

AUVT,URm×r,  VRn×rA \approx U V^T, \quad U \in \mathbb{R}^{m \times r}, \; V \in \mathbb{R}^{n \times r}

where rmin(m,n)r \ll \min(m, n) is the rank. Instead of storing mnmn values, we store (m+n)×r(m + n) \times r values—a massive compression when rr is small.

Key insight: If the keys and values across attention heads are correlated (not fully independent), then the joint key-value matrix is approximately low-rank. We can compress it into a small latent vector and decompress it on the fly.

This is exactly what MLA does.

1.5 Rotary Position Embedding (RoPE)

Transformers need a way to encode the position of each token in the sequence. Rotary Position Embedding (RoPE) does this by applying a rotation matrix to the query and key vectors that depends on their position.

For position tt, RoPE applies:

qtRtqt,ktRtkt\mathbf{q}_t \leftarrow R_t \mathbf{q}_t, \quad \mathbf{k}_t \leftarrow R_t \mathbf{k}_t

where RtR_t is a block-diagonal rotation matrix determined by position tt. The attention score qtTkj\mathbf{q}_t^T \mathbf{k}_j then naturally depends on the relative position tjt - j.

Why this matters for MLA: RoPE makes keys position-dependent. This creates a compatibility issue with low-rank compression, because the position information gets entangled with the compressed representation. MLA solves this with a clever "decoupled RoPE" strategy.

1.6 Mixture of Experts (MoE)

In a standard Transformer, every token passes through the same FFN with the same parameters. A Mixture of Experts (MoE) model has many FFN "experts" but only activates a few for each token:

output=i=1NgiFFNi(x),gi={siif iTop-K0otherwise\text{output} = \sum_{i=1}^{N} g_i \cdot \text{FFN}_i(\mathbf{x}), \quad g_i = \begin{cases} s_i & \text{if } i \in \text{Top-}K \\ 0 & \text{otherwise}\end{cases}

where sis_i is a routing score computed by a small gating network, and only the top-KK experts are activated.

Benefits: The model has many parameters (knowledge capacity) but only uses a fraction per token (computational cost). This decouples model size from inference cost.


II. What Does This Paper Do?

DeepSeek-V2 introduces two architectural innovations for the Transformer:

  1. Multi-head Latent Attention (MLA): A new attention mechanism that compresses keys and values into a low-rank latent vector, reducing the KV cache by 93.3% while actually improving performance compared to standard MHA.

  2. DeepSeekMoE: An enhanced MoE architecture with fine-grained expert segmentation and shared expert isolation that achieves strong performance at economical training costs.

The resulting model has 236B total parameters (21B activated per token), supports 128K context length, and achieves:

  • 42.5% training cost reduction compared to DeepSeek 67B
  • 93.3% KV cache reduction
  • 5.76× maximum generation throughput
  • Top-tier performance among open-source models

III. Multi-head Latent Attention (MLA): The Core Innovation

MLA is the paper's most significant technical contribution. Let us understand it step by step.

3.1 Low-Rank Key-Value Joint Compression

The core idea: instead of storing separate key and value vectors for each head, compress them into a single latent vector ctKV\mathbf{c}_t^{KV} that is much smaller.

Compression (down-projection):

ctKV=WDKVht\mathbf{c}_t^{KV} = W^{DKV} \mathbf{h}_t

where ctKVRdc\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c} is the compressed latent vector, dcnhdhd_c \ll n_h d_h is the KV compression dimension, and WDKVRdc×dW^{DKV} \in \mathbb{R}^{d_c \times d} is the down-projection matrix.

Decompression (up-projection):

ktC=WUKctKV,vtC=WUVctKV\mathbf{k}_t^C = W^{UK} \mathbf{c}_t^{KV}, \quad \mathbf{v}_t^C = W^{UV} \mathbf{c}_t^{KV}

where WUK,WUVRnhdh×dcW^{UK}, W^{UV} \in \mathbb{R}^{n_h d_h \times d_c} are up-projection matrices.

The key insight for inference: During inference, only ctKV\mathbf{c}_t^{KV} needs to be cached — not the full keys and values! The KV cache shrinks from 2nhdhl2 n_h d_h l elements per token to just dcld_c l elements.

But it gets even better: During inference, the up-projection WUKW^{UK} can be absorbed into the query projection WQW^Q. Similarly, WUVW^{UV} can be absorbed into the output projection WOW^O. This means we do not even need to decompress the keys and values explicitly! The attention is computed directly on the compressed representations.

Why absorption works: Consider the attention score computation. Instead of:

qt,iTkj,iC=qt,iT[WUK]icjKV\mathbf{q}_{t,i}^T \mathbf{k}_{j,i}^C = \mathbf{q}_{t,i}^T [W^{UK}]_i \mathbf{c}_j^{KV}

we can precompute q~t,i=[WUK]iTqt,i\tilde{\mathbf{q}}_{t,i} = [W^{UK}]_i^T \mathbf{q}_{t,i} (absorbing WUKW^{UK} into the query computation), then compute:

q~t,iTcjKV\tilde{\mathbf{q}}_{t,i}^T \mathbf{c}_j^{KV}

This is a dot product in the compressed dcd_c-dimensional space, which is both faster and uses less memory.

3.2 Query Compression (Training Optimization)

MLA also compresses queries (even though this does not reduce the KV cache):

ctQ=WDQht,qtC=WUQctQ\mathbf{c}_t^Q = W^{DQ} \mathbf{h}_t, \quad \mathbf{q}_t^C = W^{UQ} \mathbf{c}_t^Q

where ctQRdc\mathbf{c}_t^Q \in \mathbb{R}^{d_c'} is the compressed query latent vector and dcnhdhd_c' \ll n_h d_h.

This reduces activation memory during training (the stored activations needed for backpropagation), since only the smaller compressed vector needs to be saved, not the full query.

3.3 The RoPE Incompatibility Problem

Here is where things get interesting. RoPE requires applying position-dependent rotations to keys. But if we apply RoPE to the compressed keys ktC\mathbf{k}_t^C, then:

Attention score=qt,iTRoPE(WUKcjKV)\text{Attention score} = \mathbf{q}_{t,i}^T \cdot \text{RoPE}(W^{UK} \mathbf{c}_j^{KV})

The RoPE rotation matrix sits between WUKW^{UK} and cjKV\mathbf{c}_j^{KV}, coupling them. Since RoPE depends on position jj (which changes for each cached token), we can no longer absorb WUKW^{UK} into WQW^Q. We would need to recompute all keys for every generation step — destroying the efficiency gains!

3.4 Decoupled RoPE: The Elegant Solution

MLA solves this with decoupled RoPE: use separate, small vectors dedicated solely to carrying position information:

Additional RoPE queries (per head):

[qt,1R;qt,2R;;qt,nhR]=qtR=RoPE(WQRctQ)[\mathbf{q}_{t,1}^R; \mathbf{q}_{t,2}^R; \ldots; \mathbf{q}_{t,n_h}^R] = \mathbf{q}_t^R = \text{RoPE}(W^{QR} \mathbf{c}_t^Q)

Additional RoPE key (shared across heads):

ktR=RoPE(WKRht)\mathbf{k}_t^R = \text{RoPE}(W^{KR} \mathbf{h}_t)

where qt,iR,ktRRdhR\mathbf{q}_{t,i}^R, \mathbf{k}_t^R \in \mathbb{R}^{d_h^R} and dhRd_h^R is a small per-head dimension for the decoupled position signal.

Combined queries and keys:

qt,i=[qt,iC;  qt,iR],kt,i=[kt,iC;  ktR]\mathbf{q}_{t,i} = [\mathbf{q}_{t,i}^C;\; \mathbf{q}_{t,i}^R], \quad \mathbf{k}_{t,i} = [\mathbf{k}_{t,i}^C;\; \mathbf{k}_t^R]

Attention computation:

ot,i=j=1tSoftmaxj(qt,iTkj,idh+dhR)vj,iC\mathbf{o}_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j\left(\frac{\mathbf{q}_{t,i}^T \mathbf{k}_{j,i}}{\sqrt{d_h + d_h^R}}\right) \mathbf{v}_{j,i}^C

Why this works: The content-based attention (qC\mathbf{q}^C and kC\mathbf{k}^C) operates on the compressed representations and can absorb the up-projection matrices. The position-based attention (qR\mathbf{q}^R and kR\mathbf{k}^R) carries RoPE independently in a small, separate space. The two are concatenated, so the attention score is the sum of content similarity and positional relevance.

Total KV cache: During inference, we cache ctKV\mathbf{c}_t^{KV} (for content) and ktR\mathbf{k}_t^R (for position). The total cache per token is:

(dc+dhR)×l elements(d_c + d_h^R) \times l \text{ elements}

3.5 Concrete Numbers

For DeepSeek-V2:

  • nh=128n_h = 128 attention heads
  • dh=128d_h = 128 dimension per head
  • dc=512d_c = 512 (KV compression dimension = 4dh4 d_h)
  • dhR=64d_h^R = 64 (decoupled RoPE dimension = dh2\frac{d_h}{2})
  • dc=1536d_c' = 1536 (query compression dimension)
  • l=60l = 60 layers

KV cache comparison:

Mechanism Cache per Token Relative Size
MHA 2×128×128×60=1,966,0802 \times 128 \times 128 \times 60 = 1,966,080 100%
GQA (8 groups) 2×8×128×60=122,8802 \times 8 \times 128 \times 60 = 122,880 6.25%
MQA 2×128×60=15,3602 \times 128 \times 60 = 15,360 0.78%
MLA (512+64)×60=34,560(512 + 64) \times 60 = 34,560 1.76%

MLA's cache is equivalent to GQA with only 2.25 groups, yet its performance is stronger than MHA. This is the paper's central claim: MLA breaks the tradeoff between cache size and attention quality.

The 93.3% reduction: Compared to DeepSeek 67B (which uses standard MHA), the KV cache per token drops from the MHA baseline to MLA: (134,5601,966,080)×100%98.2%(1 - \frac{34,560}{1,966,080}) \times 100\% \approx 98.2\%. The paper's reported 93.3% reduction accounts for the specific configurations of both DeepSeek 67B and DeepSeek-V2.


IV. DeepSeekMoE: The FFN Architecture

4.1 Fine-Grained Expert Segmentation

Standard MoE (e.g., GShard) uses a small number of large experts. DeepSeekMoE segments experts into finer granularity: 160 routed experts (each with intermediate hidden dimension 1536) plus 2 shared experts.

Why finer is better: Smaller experts can specialize more precisely. Instead of one large expert handling "all of mathematics," multiple small experts can specialize in algebra, geometry, calculus, etc. This enables more accurate knowledge acquisition.

4.2 Shared Expert Isolation

DeepSeekMoE isolates a few shared experts that process every token, regardless of routing decisions:

ht=ut+i=1NsFFNi(s)(ut)+i=1Nrgi,tFFNi(r)(ut)\mathbf{h}_t' = \mathbf{u}_t + \sum_{i=1}^{N_s} \text{FFN}_i^{(s)}(\mathbf{u}_t) + \sum_{i=1}^{N_r} g_{i,t} \text{FFN}_i^{(r)}(\mathbf{u}_t)

where Ns=2N_s = 2 shared experts and Nr=160N_r = 160 routed experts (of which Kr=6K_r = 6 are activated per token).

Purpose: Shared experts capture common knowledge that all tokens need (e.g., basic syntax, formatting), while routed experts specialize. This reduces knowledge redundancy among routed experts.

4.3 Routing Mechanism

The routing score for token tt and expert ii:

si,t=Softmaxi(utTei)s_{i,t} = \text{Softmax}_i(\mathbf{u}_t^T \mathbf{e}_i)

where ei\mathbf{e}_i is the centroid of the ii-th expert. The top-KrK_r experts are selected, and their outputs are weighted by their routing scores.

4.4 Device-Limited Routing

With 160 experts distributed across multiple GPUs, a token might need to be sent to many devices, creating expensive communication. DeepSeek-V2 limits each token to experts on at most M=3M = 3 devices:

  1. First select the MM devices with the highest-scoring experts
  2. Then perform top-KrK_r selection only among experts on those devices

This bounds communication costs while maintaining nearly the same performance as unrestricted routing (empirically, M3M \geq 3 is sufficient).

4.5 Load Balancing: Three Auxiliary Losses

Unbalanced routing can cause "routing collapse" (some experts are never used) and computation waste. DeepSeek-V2 uses three complementary loss terms:

Expert-level balance loss (prevents routing collapse):

LExpBal=α1i=1NrfiPi\mathcal{L}_{\text{ExpBal}} = \alpha_1 \sum_{i=1}^{N_r} f_i P_i

where fif_i is the fraction of tokens routed to expert ii, and PiP_i is the average routing probability for expert ii.

Device-level balance loss (ensures balanced computation across GPUs):

LDevBal=α2i=1DfiPi\mathcal{L}_{\text{DevBal}} = \alpha_2 \sum_{i=1}^{D} f_i' P_i'

where the sum is over devices rather than individual experts.

Communication balance loss (ensures balanced inter-device data transfer):

LCommBal=α3i=1DfiPi\mathcal{L}_{\text{CommBal}} = \alpha_3 \sum_{i=1}^{D} f_i'' P_i''

where fif_i'' measures the fraction of tokens sent to device ii.

The hyperparameters are set to α1=0.003\alpha_1 = 0.003, α2=0.05\alpha_2 = 0.05, α3=0.02\alpha_3 = 0.02.

4.6 Token-Dropping Strategy

As a complementary mechanism, DeepSeek-V2 drops tokens with the lowest affinity scores on each device when the load exceeds a capacity factor of 1.0. Approximately 10% of training sequences are protected from dropping to maintain consistency. During inference, no tokens are dropped.


V. Training and Infrastructure

5.1 Pre-Training Data

  • 8.1T tokens from a high-quality, multi-source corpus
  • Chinese tokens are approximately 12% more than English tokens
  • Tokenizer: Byte-level BPE with 100K vocabulary (same as DeepSeek 67B)
  • Data quality improvements: recovered mistakenly deleted data, enriched with more Chinese data, improved quality-based filtering

5.2 Training Configuration

  • 60 Transformer layers, hidden dimension 5120
  • First layer uses a dense FFN; all subsequent layers use MoE
  • Optimizer: AdamW (β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95, weight decay = 0.1)
  • Learning rate: Warmup-and-step-decay; max LR =2.4×104= 2.4 \times 10^{-4}
  • Batch size: Gradually increased from 2304 to 9216 over the first 225B tokens
  • Sequence length: 4096 during pre-training

5.3 Infrastructure

Built on HAI-LLM, an internal training framework featuring:

  • 16-way zero-bubble pipeline parallelism
  • 8-way expert parallelism (each MoE layer's routed experts spread across 8 devices)
  • ZeRO-1 data parallelism
  • No tensor parallelism needed (due to small activated parameter count)
  • Shared expert computation overlapped with expert parallel all-to-all communication

5.4 Long Context Extension

After pre-training at 4K context, the model is extended to 128K using YaRN (Yet another RoPE extensioN method):

  • Additional training on 1000 steps with sequence length 32K
  • RoPE base frequency increased from 10,000 to 160,000
  • Scaling factor s=40s = 40, targeting 128K context length

VI. Evaluation Results

6.1 Pre-Training Performance

DeepSeek-V2 (21B activated parameters) vs. other open-source models:

Benchmark DeepSeek-V2 DeepSeek 67B LLaMA 3 70B Mixtral 8×22B
MMLU 78.5 71.3 79.5 77.8
BBH 78.9 68.7 81.0 78.9
HumanEval 48.8 45.1 48.2 46.3
MATH 43.6 18.7 42.5 41.7
GSM8K 79.2 63.4 83.0 78.6

Despite having only 21B activated parameters (vs. 70B for LLaMA 3 70B), DeepSeek-V2 achieves competitive performance, especially excelling in math and code.

6.2 Efficiency Gains

Compared to DeepSeek 67B:

Metric Improvement
Training cost 42.5% reduction
KV cache 93.3% reduction
Maximum generation throughput 5.76× increase

6.3 Alignment Performance

After SFT (1.5M conversations) and RL (GRPO — Group Relative Policy Optimization):

Benchmark DeepSeek-V2 Chat (RL)
AlpacaEval 2.0 (LC win rate) 38.9%
MT-Bench 8.97/10
AlignBench (Chinese) 7.91/10

DeepSeek-V2 Chat (RL) achieves top-tier performance among open-source chat models, and in Chinese, even outperforms most closed-source models.

6.4 Ablation: MLA vs. MHA

The paper provides an ablation comparing MLA directly to MHA (Appendix D.2). In a 16B model:

  • MLA achieves higher validation loss (better performance) than MHA with the same training compute
  • MLA is also better than GQA and MQA at all group configurations

This confirms that MLA is not just about efficiency — it actually improves model quality, likely because the low-rank constraint acts as a beneficial inductive bias that encourages the model to learn more structured, generalizable key-value representations.


VII. Limitations and Discussion

7.1 Architectural Limitations

  1. Fixed compression dimension: The choice of dc=512d_c = 512 is a hyperparameter. Too small loses information; too large wastes memory. Adaptive compression could be explored.
  2. RoPE workaround: The decoupled RoPE strategy adds complexity and a small amount of extra cache (dhRld_h^R l per token). Alternative position encoding methods might integrate more naturally with low-rank compression.
  3. Expert parallelism overhead: Despite device-limited routing, the communication costs of MoE models remain non-trivial at scale. The three auxiliary losses add tuning complexity.

7.2 Training Limitations

  1. Closed training data: The 8.1T token corpus details are vague. Reproducibility is limited by the undisclosed data pipeline.
  2. Evaluation scope: While broad, some evaluations may favor the model's training distribution (especially Chinese benchmarks).
  3. Alignment methodology: The paper uses GRPO for RL alignment but provides limited details on the reward model or preference data.

7.3 Scaling Questions

  • How does MLA scale to even larger models (e.g., the subsequent DeepSeek-V3)?
  • Is the optimal dc/(nhdh)d_c / (n_h d_h) ratio consistent across model sizes?
  • Can MLA be combined with other efficiency techniques (quantization, speculative decoding)?

VIII. Impact and Significance

8.1 Why MLA Matters

MLA represents a paradigm shift in attention mechanism design:

  1. Breaks the KV cache vs. quality tradeoff: Previous methods (GQA, MQA) sacrificed quality for efficiency. MLA achieves both simultaneously by exploiting the low-rank structure of key-value representations.

  2. Practical impact on deployment: A 93.3% KV cache reduction means dramatically higher throughput, larger batch sizes, and longer context support—directly reducing inference costs.

  3. Conceptual contribution: The idea that keys and values across heads are highly correlated (and thus compressible) is a fundamental insight about how attention works in practice.

8.2 Influence on Subsequent Work

MLA has been adopted and extended in:

  • DeepSeek-V3: The successor model, which scales MLA to 671B total parameters
  • DeepSeek-R1: The reasoning model, confirming MLA's robustness across tasks
  • Community implementations: MLA has inspired research into efficient attention mechanisms, with several groups exploring variants

8.3 MLA in the Broader Efficiency Landscape

MLA complements other efficiency techniques:

Technique What It Optimizes Compatible with MLA?
Quantization (INT8/INT4) Memory per element ✅ Yes — compress the latent vector further
Flash Attention Compute (IO-aware) ✅ Yes — applies to the attention computation
Speculative decoding Latency ✅ Yes — smaller KV cache speeds up verification
GQA/MQA KV cache (by head sharing) ❌ Replaced by MLA
KV cache eviction Memory (by dropping old KV) ✅ Yes — but MLA reduces the need

IX. Reproducibility

Criterion Assessment
Code availability ✅ Model checkpoints at github.com/deepseek-ai/DeepSeek-V2
Architecture details ✅ Complete formulas in paper and appendix
Hyperparameters ✅ All model and training hyperparameters specified
Training data ❌ 8.1T token corpus not publicly released; composition not detailed
Training infrastructure ⚠️ Internal framework (HAI-LLM); general approach described
Ablation studies ✅ MHA vs. GQA vs. MQA vs. MLA comparisons provided
Evaluation details ✅ Benchmark settings and prompts specified
Reproducibility risk High — training data and infrastructure are proprietary

The architecture is well-documented and reproducible. The training process is not, due to proprietary data and infrastructure. However, the model checkpoints are publicly available, enabling downstream use and evaluation.


X. Summary: Key Takeaways

  1. MLA's core trick: Compress keys and values into a shared low-rank latent vector ctKV\mathbf{c}_t^{KV}. Cache only this small vector instead of full keys and values. Absorb the decompression matrices into query/output projections to avoid even computing full keys/values during inference.

  2. Decoupled RoPE solves the position encoding problem: By carrying position information in a separate, small vector (ktR\mathbf{k}_t^R), MLA avoids the incompatibility between RoPE and low-rank compression. This is a clever engineering solution to a real mathematical obstacle.

  3. 93.3% KV cache reduction with better performance: MLA caches only (dc+dhR)l(d_c + d_h^R) l elements per token, equivalent to GQA with 2.25 groups, yet outperforms full MHA. The low-rank constraint likely acts as a beneficial regularizer.

  4. DeepSeekMoE enables economical scaling: Fine-grained expert segmentation (160 experts, 6 activated) with shared experts achieves better performance than coarse-grained MoE at equivalent cost. Device-limited routing and three-tier load balancing ensure practical efficiency.

  5. The numbers are impressive: 236B parameters, 21B activated, 8.1T training tokens, 128K context, 42.5% training cost reduction, 5.76× throughput improvement — all while matching or exceeding 70B-class dense models.

  6. MLA is the ancestor of DeepSeek-V3 and R1: This architecture proved so effective that it became the foundation for DeepSeek's subsequent models, which have pushed the frontier of open-source LLMs.

  7. The key lesson: In attention mechanisms, the assumption that each head needs independent keys and values is wasteful. The real information content of key-value representations is much lower-dimensional than the raw vector space suggests, and exploiting this structure yields dramatic efficiency gains with no quality penalty.