0%

Mamba: Linear-Time Sequence Modeling with Selective State Spaces — In-Depth Technical Review

1. What This Paper Does

Mamba introduces a selective state space model (selective SSM) that, for the first time, achieves Transformer-quality performance on language modeling while scaling linearly in sequence length — not quadratically like standard attention. The key insight is deceptively simple: make the parameters of a state space model depend on the input, so the model can choose what to remember and what to forget at each timestep. This seemingly small change has profound consequences: it breaks the mathematical equivalence between SSMs and convolutions that prior work relied on for efficiency, forcing the authors to invent a new hardware-aware parallel algorithm. The result is Mamba, a clean architecture with no attention and no MLP blocks, that runs at 5× the inference throughput of Transformers and matches or exceeds their quality across language, audio, and genomics.

This is not an incremental improvement. Mamba represents a genuine paradigm shift: the first credible alternative to the Transformer as a general-purpose sequence model backbone.


2. Prerequisites: Everything You Need to Understand This Paper

Before diving into Mamba's innovations, we need to build up several layers of background knowledge. If you already know these topics, feel free to skip ahead; if not, this section will give you everything you need.

2.1 Sequence Modeling: The Core Problem

At the heart of modern deep learning is the problem of sequence modeling: given a sequence of inputs (words, audio samples, DNA bases, pixels), produce a sequence of outputs. The dominant approach for the past 7+ years has been the Transformer (Vaswani et al., 2017), which uses self-attention to allow every element in the sequence to "look at" every other element.

Self-attention is powerful because it can route information freely between any two positions. But it has a fundamental cost: for a sequence of length LL, attention requires O(L2)O(L^2) time and memory, because every pair of positions must interact. For a 1,000-token sequence, that's 1 million pairs. For 100,000 tokens, it's 10 billion pairs. This quadratic scaling is the central bottleneck of Transformers.

2.2 Recurrent Neural Networks (RNNs)

Before Transformers, the dominant sequence models were recurrent neural networks (RNNs). An RNN processes a sequence one element at a time, maintaining a hidden state hth_t that summarizes everything it has seen so far:

ht=f(ht1,xt)h_t = f(h_{t-1}, x_t)

The advantage: inference is O(1)O(1) per step (constant time), and training is O(L)O(L) (linear in sequence length). The disadvantage: the hidden state is a fixed-size vector, so the model must compress all past information into this finite bottleneck. In practice, vanilla RNNs struggle to remember information over long distances — the famous "vanishing gradient" problem.

LSTM and GRU cells partially addressed this with gating mechanisms: learned gates that control what information flows into, stays in, or leaves the hidden state. A gate gt[0,1]g_t \in [0, 1] acts as a soft switch:

ht=(1gt)ht1+gtxth_t = (1 - g_t) \cdot h_{t-1} + g_t \cdot x_t

When gt0g_t \approx 0, the state is preserved (ignoring the input). When gt1g_t \approx 1, the state is reset (focusing on the input). This gating concept will reappear centrally in Mamba.

2.3 State Space Models (SSMs): The Mathematical Foundation

A state space model is a classical mathematical framework from control theory (Kalman, 1960). It describes a system that evolves through time with a hidden state:

Continuous form:

h(t)=Ah(t)+Bx(t)h'(t) = Ah(t) + Bx(t)

y(t)=Ch(t)y(t) = Ch(t)

Here, x(t)Rx(t) \in \mathbb{R} is the input signal, h(t)RNh(t) \in \mathbb{R}^N is the hidden state (with NN dimensions), and y(t)Ry(t) \in \mathbb{R} is the output. The matrices ARN×NA \in \mathbb{R}^{N \times N}, BRN×1B \in \mathbb{R}^{N \times 1}, and CR1×NC \in \mathbb{R}^{1 \times N} define the system dynamics.

Think of this like a physical system: AA governs how the internal state evolves on its own, BB controls how external input enters the state, and CC reads out the output from the state.

2.4 Discretization: From Continuous to Discrete

Real data comes in discrete chunks (tokens, samples), so we need to convert the continuous SSM to discrete time. This is done via discretization, controlled by a step size parameter Δ\Delta:

Using the zero-order hold (ZOH) rule:

Aˉ=exp(ΔA)\bar{A} = \exp(\Delta A)

Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B

The discretized system then operates as a standard recurrence:

ht=Aˉht1+Bˉxth_t = \bar{A} h_{t-1} + \bar{B} x_t

yt=Chty_t = C h_t

The parameter Δ\Delta is crucial: it controls the "resolution" of discretization. A large Δ\Delta means the system takes big steps (focusing on the current input), while a small Δ\Delta means tiny steps (the input barely affects the state). This will become Mamba's primary selection lever.

2.5 The Dual View: Recurrence and Convolution

A remarkable property of discrete SSMs is that the recurrence can be "unrolled" into a global convolution:

K=(CB,CAˉB,CAˉ2B,)K = (CB, C\bar{A}B, C\bar{A}^2B, \ldots)

y=xKy = x * K

This convolution kernel KK is entirely determined by the fixed parameters (Aˉ,Bˉ,C)(\bar{A}, \bar{B}, C). This dual view is powerful:

  • Training: Use the convolution form (parallelizable on GPUs, efficient FFT-based computation)
  • Inference: Use the recurrence form (process one token at a time, constant memory)

This is the foundation of the S4 model (Gu et al., 2022) and all its variants, which achieved remarkable results on long-range benchmarks. But there's a catch...

2.6 Linear Time Invariance (LTI): The Critical Limitation

The convolution trick works only when the parameters (A,B,C,Δ)(A, B, C, \Delta) are constant across all timesteps. This property is called Linear Time Invariance (LTI). An LTI system treats every input the same way — it cannot adapt its behavior based on what it's currently seeing.

This is fine for continuous signals like audio waveforms, where the dynamics are smooth and uniform. But for discrete, information-dense data like text, LTI is crippling. Consider: when reading "The capital of France is ___", the model needs to pay special attention to "France" and recall "Paris" — but an LTI system processes "France" with exactly the same dynamics as "the" or "of" or "is". It cannot selectively focus.

2.7 GPU Memory Hierarchy

To understand Mamba's hardware-aware algorithm, we need to know about GPU memory:

  • HBM (High Bandwidth Memory): The main GPU memory (e.g., 80 GB on A100). Large but relatively slow to access.
  • SRAM (Static RAM): On-chip memory (e.g., 20 MB on A100). Tiny but extremely fast — roughly 10-100× faster than HBM.
  • Registers: Even smaller and faster, within each streaming multiprocessor.

Most operations in deep learning (except matrix multiplication) are memory-bound: their speed is limited by how fast data can be moved between HBM and SRAM, not by actual computation. This insight, pioneered in FlashAttention (Dao et al., 2022), is central to Mamba's design.

2.8 Parallel Scan

A scan (also called prefix sum) is an operation that computes cumulative results over a sequence. For addition: given [a1,a2,a3,a4][a_1, a_2, a_3, a_4], the scan produces [a1,a1+a2,a1+a2+a3,a1+a2+a3+a4][a_1, a_1+a_2, a_1+a_2+a_3, a_1+a_2+a_3+a_4].

Scans appear sequential, but there's a classical parallel algorithm (Blelloch, 1990) that computes them in O(logL)O(\log L) parallel steps while doing O(L)O(L) total work. This is called a work-efficient parallel scan. It works for any associative operation — including the matrix recurrence in SSMs. This algorithm is how Mamba avoids the sequential bottleneck of recurrence.

2.9 Prior SSM Architectures

Several architectures have been built on top of SSMs:

  • H3 (Dao et al., 2023): Sandwiches an S4 layer between two gating connections, plus a local convolution. Interleaved with MLP blocks.
  • Hyena (Poli et al., 2023): Same architecture as H3 but replaces S4 with an MLP-parameterized global convolution.
  • RetNet (Sun et al., 2023): Adds gates and uses a simpler SSM with a multi-head attention variant for parallel computation.
  • RWKV (Peng et al., 2023): An RNN for language modeling based on linear attention approximation.

All of these are LTI — none can perform content-based selection. This is the gap Mamba fills.


3. The Selection Mechanism: Mamba's Core Innovation

3.1 Motivation: Selection as Compression

Gu and Dao frame the fundamental problem of sequence modeling as compression. Every model must compress context into some representation:

  • Attention: No compression at all. Stores the entire context in the KV cache. Maximally expressive but maximally expensive (O(L2)O(L^2) time, O(L)O(L) memory per step during inference).
  • Recurrent models: Compress everything into a fixed-size state. Maximally efficient (O(1)O(1) per step) but limited by how well the compression works.

The key insight: the quality of compression depends on the ability to select what to keep and what to discard. A model that cannot distinguish important tokens from noise will waste its finite state capacity on irrelevant information.

3.2 Two Diagnostic Tasks

The paper uses two synthetic tasks to illustrate why LTI models fail:

Selective Copying: A modification of the classic copying task where tokens to memorize appear at random positions, interspersed with noise tokens. An LTI model (global convolution) can solve the standard copying task by learning a fixed-delay kernel, but fails at selective copying because the delays are variable — the model needs to look at the content to decide what to copy.

Induction Heads: Given a sequence like "...A B ... A ?", the model must predict "B". This requires associative recall: recognizing that the current context matches a previous pattern and retrieving the associated value. This is a core mechanism behind in-context learning in LLMs.

Both tasks require content-aware reasoning — exactly what LTI models cannot do.

3.3 Making SSM Parameters Input-Dependent

The solution is remarkably simple: make the SSM parameters functions of the input. Specifically, instead of having fixed BB, CC, and Δ\Delta across all timesteps, Mamba computes them from the current input at each step:

S4 (time-invariant):

  • ARD×NA \in \mathbb{R}^{D \times N}: fixed parameter
  • BRD×NB \in \mathbb{R}^{D \times N}: fixed parameter
  • CRD×NC \in \mathbb{R}^{D \times N}: fixed parameter
  • ΔRD\Delta \in \mathbb{R}^{D}: fixed (after applying τΔ\tau_\Delta to a learned parameter)

S6 / Selective SSM (time-varying):

  • ARD×NA \in \mathbb{R}^{D \times N}: still a fixed parameter (selectivity through Δ\Delta is sufficient)
  • BRB×L×NB \in \mathbb{R}^{B \times L \times N}: computed as sB(x)=LinearN(x)s_B(x) = \text{Linear}_N(x) — a learned projection of the input
  • CRB×L×NC \in \mathbb{R}^{B \times L \times N}: computed as sC(x)=LinearN(x)s_C(x) = \text{Linear}_N(x) — a learned projection of the input
  • ΔRB×L×D\Delta \in \mathbb{R}^{B \times L \times D}: computed as τΔ(Parameter+sΔ(x))\tau_\Delta(\text{Parameter} + s_\Delta(x)) where sΔ(x)=BroadcastD(Linear1(x))s_\Delta(x) = \text{Broadcast}_D(\text{Linear}_1(x)) and τΔ=softplus\tau_\Delta = \text{softplus}

The crucial dimensions: BB and CC now have a sequence length dimension LL, meaning they vary at every timestep. The model has shifted from time-invariant to time-varying.

3.4 What Each Selective Parameter Does

Δ\Delta (the most important): Controls the balance between remembering (small Δ\Delta, persist the state) and resetting (large Δ\Delta, focus on current input). When Δ\Delta is input-dependent, the model can decide token-by-token whether to attend to or ignore the current input. This directly generalizes RNN gating (Theorem 1 in the paper).

Theorem 1 (Gating Connection): When N=1N=1, A=1A = -1, B=1B = 1, and Δ\Delta is input-dependent with softplus activation, the selective SSM reduces exactly to a gated RNN:

gt=σ(Linear(xt))g_t = \sigma(\text{Linear}(x_t))

ht=(1gt)ht1+gtxth_t = (1 - g_t) h_{t-1} + g_t x_t

This shows that classical RNN gating is a special case of selective SSMs.

BB (input gate): Controls whether the current input xtx_t is admitted into the hidden state hth_t. Making BB input-dependent allows fine-grained filtering: the model can close the gate on noise tokens.

CC (output gate): Controls whether the hidden state contributes to the current output. Making CC input-dependent allows the model to modulate its output based on what it currently sees.

AA (deliberately not made selective): While AA could also be input-dependent, the authors note that AA only affects the model through its interaction with Δ\Delta via Aˉ=exp(ΔA)\bar{A} = \exp(\Delta A). Since Δ\Delta is already selective, making AA selective too would be redundant.

3.5 Three Practical Effects of Selection

  1. Variable Spacing: The model can skip irrelevant tokens (noise, filler words like "um") by setting gt0g_t \to 0, effectively compressing the sequence to only its informative parts.

  2. Filtering Context: Unlike LTI models that accumulate everything indiscriminately, selective models can ignore irrelevant context, explaining why they actually improve with longer sequences (while LTI models often plateau or degrade).

  3. Boundary Resetting: When processing concatenated documents (common in training), the model can reset its state at document boundaries by setting Δt\Delta_t \to \infty (equivalently, gt1g_t \to 1), preventing information leakage between unrelated sequences.


4. The Hardware-Aware Algorithm: Making Selection Efficient

4.1 The Efficiency Problem

The selection mechanism creates a fundamental computational challenge. Prior SSMs could use the convolution form for efficient parallel training. But the convolution trick requires LTI parameters — once the parameters vary with time, the convolution kernel changes at every step, and the trick breaks.

The naive approach would be to compute the recurrence sequentially: process each timestep one at a time, updating the state. This would work but be extremely slow on GPUs, which are designed for parallel computation.

Furthermore, the time-varying parameters create a memory problem. The full state tensor has shape (B,L,D,N)(B, L, D, N) — batch × sequence length × model dimension × state dimension. With N=16N = 16, this is 16× larger than the input, and materializing it in GPU memory would be prohibitive for long sequences.

4.2 Three Classical Techniques

Mamba's solution combines three well-known techniques in a novel way:

1. Kernel Fusion: Instead of computing each operation separately (discretization, then recurrence, then output), which requires writing intermediate results to slow HBM, Mamba fuses everything into a single GPU kernel. The SSM parameters (Δ,A,B,C)(\Delta, A, B, C) are loaded from HBM to fast SRAM once, and all computation (discretization → recurrence → output) happens in SRAM, with only the final output (B,L,D)(B, L, D) written back to HBM.

2. Parallel Scan: The recurrence ht=Aˉtht1+Bˉtxth_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t is technically sequential, but because the state transition is an affine operation (which is associative when composed), it can be parallelized using Blelloch's (1990) work-efficient parallel scan algorithm. This reduces the sequential depth from O(L)O(L) to O(logL)O(\log L).

3. Recomputation: During the forward pass, intermediate states are not saved to memory. During the backward pass, they are recomputed on-the-fly by reloading the inputs from HBM to SRAM and re-running the scan. This trades a small amount of extra computation for a large memory savings — the same trick used in FlashAttention and gradient checkpointing.

4.3 Computational Complexity

The resulting algorithm achieves:

  • Time: O(BLDN)O(BLDN) — linear in sequence length LL, proportional to state dimension NN
  • Memory: O(BLD)O(BLD) — same as the input/output, not the expanded state
  • I/O: Dominated by a single read/write of the input/output from HBM

This is actually faster than the convolution-based approach of prior SSMs, which required O(BLDlogL)O(BLD \log L) time due to the FFT. The fused scan is also up to 3× faster than prior methods on A100 GPUs.

4.4 Speed Benchmarks

Concrete numbers from the paper:

  • The fused selective scan is 20-40× faster than a standard PyTorch scan implementation
  • It is faster than FlashAttention-2 for sequences longer than 2K tokens
  • End-to-end, Mamba achieves 4-5× higher inference throughput than a Transformer of similar size
  • A Mamba-6.9B model has higher inference throughput than a Transformer-1.3B (5× smaller!)

The inference advantage comes from Mamba's recurrent nature: it doesn't need a KV cache, so it can use much larger batch sizes for the same GPU memory.


5. The Mamba Architecture

5.1 Block Design

Prior SSM architectures (like H3) interleaved two types of blocks: an SSM block (inspired by linear attention) and an MLP block. Mamba simplifies this by combining both into a single block:

The Mamba Block:

  1. Input xRB×L×Dx \in \mathbb{R}^{B \times L \times D}
  2. Two parallel linear projections expand the dimension by factor E=2E = 2:
    • Branch A: Linear → Conv1d → SiLU activation → Selective SSM → output
    • Branch B: Linear → SiLU activation → (used as a multiplicative gate)
  3. Element-wise multiply of Branch A and Branch B (gating)
  4. Linear projection back to dimension DD
  5. Add residual connection and LayerNorm

This is inspired by the Gated Attention Unit (GAU) which similarly merged attention with MLP. The SiLU/Swish activation makes the gated linear unit equivalent to the popular SwiGLU variant used in LLaMA and PaLM.

5.2 Parameter Count

For each Mamba block with expansion factor E=2E = 2:

  • Input projections: 2ED22ED^2 parameters (the two parallel branches)
  • Output projection: ED2ED^2 parameters
  • Total for linear projections: 3ED2=6D23ED^2 = 6D^2
  • SSM parameters (projections for Δ\Delta, BB, CC, and matrix AA): negligible in comparison

Two stacked Mamba blocks have roughly 12D212D^2 parameters, matching the 12D212D^2 of a Transformer's interleaved MHA (4D24D^2 for Q/K/V/O projections) and MLP (8D28D^2 for the two-layer FFN).

5.3 No Attention, No MLP

A striking aspect of Mamba is what it does not have:

  • No attention mechanism — no Q/K/V projections, no softmax, no KV cache
  • No separate MLP blocks — the gating within the Mamba block subsumes the MLP's role
  • No positional encodings — the recurrent state implicitly encodes position

The architecture is homogeneous: just repeat the same Mamba block with residual connections and normalization. This simplicity is both aesthetically pleasing and practically beneficial for implementation and scaling.

5.4 Additional Design Choices

  • Real vs. Complex: Mamba uses real-valued states by default (not complex as in S4). Real values work better for discrete data (text, DNA) while complex may help for continuous data (audio).
  • Initialization: Uses S4D-Real initialization: An=(n+1)A_n = -(n+1) for the nn-th diagonal element, based on HiPPO theory. But the authors note that random initialization also works well.
  • Δ\Delta initialization: τΔ1(Uniform([0.001,0.1]))\tau_\Delta^{-1}(\text{Uniform}([0.001, 0.1])), i.e., the inverse-softplus of small uniform values.
  • Optional LayerNorm: An additional normalization layer after the SSM, inspired by RetNet.

6. Experimental Results: A Comprehensive Analysis

6.1 Synthetic Tasks

Selective Copying (Table 1):

Architecture SSM Layer Accuracy
No gate S4 (LTI) 18.3%
No gate S6 (Selective) 97.0%
H3 S4 57.0%
H3 Hyena 30.1%
H3 S6 99.7%
Mamba S4 56.4%
Mamba S6 99.8%

Key finding: Architecture gating (H3, Mamba) helps somewhat, but the selection mechanism (S6) is the decisive factor. Without selection, even sophisticated architectures barely exceed random chance on this task.

Induction Heads (Table 2): Mamba solves the induction heads task perfectly and extrapolates to 1,048,576 tokens — 4,000× longer than the training length of 256. No other model goes beyond 2× extrapolation:

  • Multi-Head Attention (all positional encodings): fails beyond 16,384 tokens (limited by memory)
  • H3, Hyena: accuracy drops rapidly beyond training length
  • Mamba: 100% accuracy at 1M tokens

6.2 Language Modeling: Scaling Laws

Models from 125M to 1.3B parameters, trained on the Pile dataset following GPT-3 training recipe:

Key result: Mamba is the first attention-free model to match the performance of Transformer++ (the strong recipe with rotary embeddings, SwiGLU, RMSNorm from LLaMA/PaLM). The gap widens in Mamba's favor at longer sequences.

Other subquadratic models tested (Hyena, RWKV, RetNet, H3) all fall measurably behind both Transformer++ and Mamba.

6.3 Language Modeling: Zero-Shot Downstream (Table 3)

Mamba vs. Pythia (same training data, tokenizer, and 300B tokens) on standard benchmarks:

Mamba-370M vs Pythia-410M:

Benchmark Pythia-410M Mamba-370M
LAMBADA (acc) 51.4% 55.6%
HellaSwag 40.6% 46.5%
PIQA 66.9% 69.5%
Arc-E 52.1% 55.1%
WinoGrande 53.8% 55.3%
Average 48.2% 50.0%

Mamba-1.4B vs Pythia-1.4B (and larger):

Benchmark Pythia-1.4B RWKV-1.5B Mamba-1.4B
LAMBADA (acc) 61.7% 56.4% 64.9%
HellaSwag 52.1% 52.5% 59.1%
PIQA 71.0% 72.4% 74.2%
Arc-E 60.5% 60.5% 65.5%
Arc-C 28.5% 29.4% 32.8%
WinoGrande 57.2% 54.6% 61.5%
Average 55.2% 54.3% 59.7%

Mamba-2.8B vs Pythia-2.8B:

Benchmark Pythia-2.8B Mamba-2.8B
LAMBADA (acc) 64.7% 69.2%
HellaSwag 59.3% 66.1%
Average 59.1% 63.3%

Mamba-3B's average performance (63.3%) even exceeds Pythia-6.9B (61.7%) — a model more than twice its size. This is the paper's most striking result for language modeling.

6.4 DNA Modeling

Model size scaling (Figure 5, Left): On the HG38 human genome dataset with sequence length 1024, Mamba scales better than both HyenaDNA and Transformer++. At the largest model size (~40M parameters), Mamba matches the baselines with roughly 3-4× fewer parameters.

Context length scaling (Figure 5, Right): This is where selection shines most dramatically. Mamba's perplexity monotonically improves as context increases from 1K to 1M tokens. HyenaDNA (an LTI model) actually gets worse with longer context — exactly as predicted by the theory in Section 3.5. LTI models accumulate noise indiscriminately, while selective models filter it out.

Great Apes Classification (Figure 6): A challenging task of distinguishing five great ape species (sharing 99% of their DNA) from DNA segments. Mamba with 1M context achieves ~90% accuracy, far exceeding HyenaDNA (~70%) and the random baseline (20%). With short context, performance is limited — the discriminative signals are sparse in the genome.

6.5 Audio Modeling

Autoregressive Pretraining (Figure 7): On the YouTubeMix piano dataset, Mamba improves with longer context up to 960K samples (~60 seconds). At all lengths, Mamba outperforms SaShiMi (S4+MLP), with the gap widening at longer sequences.

Note: This is the only experiment using complex-valued SSMs, validating the continuous-discrete spectrum insight.

Speech Generation on SC09 (Table 4):

Model Params FID ↓ IS ↑
WaveNet 4.2M 5.08 2.27
SaShiMi 5.8M 1.99 5.13
DiffWave + SaShiMi 23.0M 1.42 5.94
Mamba (small) 6.1M 0.94 6.26
Mamba (large) 24.3M 0.67 7.33

The small Mamba model (6.1M params) already beats the much larger GAN and diffusion baselines. The large model achieves an FID of 0.67 — roughly half of the next best.

Architecture Ablation (Table 5): In the SaShiMi U-Net backbone, replacing S4+MLP with Mamba blocks improves every metric. The ranking is: Mamba > S4+MLP > MHA+MLP in center blocks, and Mamba > S4+MLP consistently in outer blocks.

6.6 Efficiency Benchmarks (Figure 8)

Training speed:

  • Mamba's fused scan: 20-40× faster than standard PyTorch scan
  • Faster than FlashAttention-2 beyond 2K sequence length
  • Faster than convolution-based SSMs (which require O(LlogL)O(L \log L) FFTs)

Inference throughput:

  • Mamba: 4-5× higher throughput than similar-size Transformers
  • Mamba-6.9B achieves higher throughput than Transformer-1.3B
  • The advantage comes from no KV cache → can use much larger batch sizes

6.7 Ablation Studies

Architecture vs. SSM Layer (Table 6):

  • Among LTI layers (S4, Hyena), performance is virtually identical (PPL ~10.2-10.5)
  • Switching any LTI layer to selective (S6) drops perplexity dramatically to ~8.7-9.0
  • Real-valued SSMs perform as well as complex-valued for language modeling
  • Mamba architecture is slightly better than H3 architecture (8.69 vs 8.95 with S6)

Selective Parameters (Table 7):

Selective Δ\Delta Selective BB Selective CC Perplexity
No No No 10.93
No Yes No 10.15
No No Yes 9.98
Yes No No 9.81
Yes Yes Yes 8.71

Δ\Delta is the most impactful single parameter (consistent with its connection to gating in Theorem 1), but all three synergize together for the best result.

State Dimension NN (Table 10): With selective BB and CC, increasing NN from 1 to 16 improves perplexity from 9.73 to 8.71 — over 1 point — for only 1% more parameters. Without selection, increasing NN barely helps (9.88 → 9.81). This validates the core thesis: large state dimensions are only useful if the model can selectively populate them.

Δ\Delta Projection Dimension (Table 9): Even projecting to dimension 1 (from static) provides a large improvement (10.93 → 8.97). Increasing the projection dimension further provides modest gains (8.97 → 8.71 at dimension 64).

AA Initialization (Table 8): S4D-Real and random initializations perform equally well (PPL 8.71), while the complex S4D-Lin initialization is slightly worse (9.16). This suggests that with selection, the specific initialization of AA matters less.


7. Limitations and Open Questions

7.1 Scale

The paper evaluates up to 2.8B parameters. While the scaling laws look favorable, it remains unproven whether Mamba maintains its advantage at 7B, 70B, or larger scales. The Transformer has been validated at these sizes; Mamba hasn't yet.

7.2 Ecosystem and Affordances

Transformers have a rich ecosystem: fine-tuning recipes, RLHF pipelines, quantization toolkits, prompt engineering techniques, LoRA/adapters, etc. It's unknown whether these techniques transfer smoothly to Mamba or whether new techniques are needed.

7.3 Continuous-Discrete Tradeoff

The selection mechanism improves discrete data (text, DNA) but may slightly hurt continuous data (audio, video) where LTI models excel. The paper's audio experiments use complex-valued SSMs to partially compensate, but the tradeoff is inherent.

7.4 In-Context Learning

While Mamba performs well on the synthetic induction heads task, the full landscape of in-context learning in large Mamba models is unexplored. Transformers' in-context learning emerges at scale; it's unclear if Mamba exhibits the same phenomena.

7.5 Attention-Free Means No Explicit Retrieval

Attention provides an explicit mechanism for looking up specific past tokens. Mamba's recurrent state must compress all past information, which may be a disadvantage for tasks requiring precise retrieval from long contexts (e.g., "what was the 7th word?").


8. Reproducibility

Open source: Full code and pretrained checkpoints are available at https://github.com/state-spaces/mamba.

Hardware: Experiments ran on NVIDIA A100 GPUs. The custom CUDA kernels for the selective scan are a core part of the contribution and are included in the release.

Training details: The paper follows standard recipes (Chinchilla scaling, GPT-3 hyperparameters for language, HyenaDNA setup for genomics). All hyperparameters are detailed in Appendix E.

Reproducibility concerns:

  • The custom CUDA kernels are specific to NVIDIA hardware and would need porting for other accelerators (AMD, TPUs)
  • The scaling law comparison uses models up to 1.3B; larger-scale validation is left to future work
  • Some baselines (RWKV, RetNet) were missing results at the longest context lengths due to efficiency issues, making the comparison incomplete at those settings

9. Impact and Legacy

Mamba's impact has been enormous. Within months of release, it spawned:

  • Mamba-2 (Dao & Gu, 2024): Reformulated the selective SSM as structured matrix multiplication, further improving speed
  • Vision Mamba / Vim: Application to vision tasks, competing with Vision Transformers
  • Jamba: A hybrid Mamba-Transformer architecture from AI21
  • MambaByte: Character-level language modeling with Mamba
  • State Space Duality: Theoretical connections between SSMs and attention

The paper's contribution goes beyond a single model: it demonstrates that the Transformer's dominance is not inevitable, and that carefully designed alternatives can compete while being fundamentally more efficient. The selection mechanism — making model parameters input-dependent — is a general principle that has already been applied to other architectures.

Whether Mamba (or its successors) ultimately replaces the Transformer remains to be seen. But it has permanently expanded our vocabulary of what's possible in sequence modeling.


10. Summary

Mamba makes three tightly coupled contributions:

  1. The Selection Mechanism: Making SSM parameters (BB, CC, Δ\Delta) input-dependent, enabling content-aware filtering and context compression. This is the conceptual breakthrough.

  2. The Hardware-Aware Algorithm: A fused selective scan that computes in SRAM, uses parallel scan for parallelism, and recomputes states in the backward pass. This is the engineering breakthrough that makes selection practical.

  3. The Simplified Architecture: Merging the SSM block and MLP block into a single homogeneous Mamba block with no attention. This is the design insight that makes the model clean and scalable.

Together, these yield a model that is simultaneously:

  • More expressive than prior SSMs (matches Transformer quality on language)
  • More efficient than Transformers (5× inference throughput, linear-time training)
  • More scalable in context length (improves up to 1M tokens on real data)
  • Simpler in architecture (one repeated block, no attention, no positional encoding)

The Mamba paper is a masterclass in identifying a fundamental limitation (LTI), proposing a minimal and principled fix (selection), solving the engineering challenges that fix creates (hardware-aware scan), and validating the result across diverse domains. It represents one of the most significant architectural innovations in deep learning since the Transformer itself.


Review completed: March 28, 2026