Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

State Space Models and Mamba

State Space Models and Mamba

Updated 2026-04-13

Introduction

Standard Attention has O(L2)O(L^2) computational complexity (LL = sequence length), and the KV cache grows linearly with sequence length during inference. A 100K-token context means the Attention matrix has 101010^{10} elements — memory and compute costs become an unavoidable bottleneck.

RNNs seem like a solution: O(1)O(1) inference complexity, requiring only a fixed-size hidden state per step. But traditional RNNs have two fatal weaknesses: (1) vanishing/exploding gradients prevent learning long-range dependencies; (2) the inherently sequential nature of step-by-step recursion makes training unparallelizable.

Is there an approach that solves both problems simultaneously — parallelizable training (like Attention) + O(1)O(1) inference (like RNNs) + theoretically learnable arbitrary long-range dependencies? State Space Models (SSMs) are exactly such an approach. Originating from control theory and signal processing, the core idea is to compress the entire history of a sequence into a fixed-size state vector, updating the state for each new token rather than looking back over the full history. The key SSM evolution: HiPPO (2020) → S4 (2021) → H3 (2022) → Mamba (2023) → Mamba-2 (2024).


1. Continuous State Space Models

The mathematical foundation of SSMs is a continuous-time linear dynamical system:

x˙(t)=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t)\dot{x}(t) = Ax(t) + Bu(t), \quad y(t) = Cx(t) + Du(t)

Here u(t)u(t) is the input signal, x(t)RNx(t) \in \mathbb{R}^N is the hidden state, and y(t)y(t) is the output. Matrix ARN×NA \in \mathbb{R}^{N \times N} governs how the state evolves, BRN×1B \in \mathbb{R}^{N \times 1} controls how input is written into the state, and CR1×NC \in \mathbb{R}^{1 \times N} controls how the output is read from the state.

The DD term is a direct feedforward connection (skip connection). In S4 and Mamba implementations, it is typically set to 0 or identity, and does not affect the core SSM dynamics — we omit it hereafter.

Intuition: The state xx is a “compressed history summary.” Unlike Attention — which looks back over the entire token sequence at every step — an SSM maintains only a fixed-size state, with all historical information compressed into these NN dimensions. Think of it as a “filter with memory”: the input signal passes through the system, gets remembered and mixed by the state, then produces the output.

The above is the simplest single-input single-output (SISO) form. In practice, the SSM runs independently on each feature dimension DD of the model (similar to depth-wise convolution).

Signal processing analogy: An SSM is like a “memoryful IIR filter.” Input signal u(t)u(t) passes through the system; state x(t)x(t) acts as the filter’s internal registers; AA controls feedback between registers (pole locations); BB controls the path from input to registers; CC controls the path from registers to output.

Key question: The initialization of the AA matrix determines the “mathematical structure of memory.” Randomly initialized AA causes exponential information decay, preventing the model from learning long-range dependencies. This is precisely the problem that HiPPO, covered in the next section, aims to solve.

1. Initial state x₀ = 0
SSM initialization: empty state vectorx₀ (state)0.000.000.000.00∈ ℝ^4State vector x ∈ ℝᴺ is SSM's "memory" — fixed size, does not grow with sequenceN is typically 16-64, much smaller than sequence length

1.5 HiPPO: The Mathematical Foundation of SSM Memory

How can an NN-dimensional state vector “optimally” compress the history of a continuous signal? A randomly initialized AA matrix causes exponential information decay — the state only remembers the most recent steps, and distant information is rapidly lost.

HiPPO (High-order Polynomial Projection Operators, Gu et al., NeurIPS 2020) provides an elegant solution: approximate the signal history using an orthogonal polynomial basis (Legendre polynomials). The nn-th component of the state vector = the projection coefficient of the signal onto the nn-th Legendre polynomial. Thus, the NN-dimensional state is an NN-th order polynomial approximation of the signal history, rather than simple exponential decay.

The HiPPO-LegS matrix (HiPPO paper Section 3.2) has a special lower-triangular + diagonal structure. This specific AA matrix makes the state update exactly equivalent to “online computation of Legendre projection coefficients” — with each new token processed, the state vector automatically maintains an optimal polynomial approximation of the entire signal history.

1. Random A: The Forgetting Problem
Random A initialization → exponential decayInput signal (6 tokens)w₁w₂w₃w₄w₅w₆State vector (N=4)dim₀dim₁dim₂dim₃Random A → exponential decay → only recent tokens survive

Why it matters: S4 experiments demonstrated that S4 with HiPPO initialization achieved SoTA on Path-X (16384-step sequence classification), while random initialization completely failed. HiPPO provides the theoretical foundation for SSMs to learn arbitrarily long-range dependencies.

Mamba’s simplification: Mamba does not use the full HiPPO matrix, but instead uses the simpler S4D-Real initialization: A=diag(1,2,...,N)A = -\text{diag}(1, 2, ..., N). In code, A_log = log([1,...,N]) is stored as a parameter, and at computation time A = -exp(A_log). The negative sign ensures state decay (stability), and log-space storage ensures numerical stability. Mamba’s selectivity mechanism (input-dependent Δ\Delta, BB, CC) compensates for the reduced expressiveness of the simplified initialization.


2. Discretization: From Continuous to Sequential

Why discretize? Language models process discrete token sequences, but the SSM’s mathematics is in continuous time. Analogy: continuous signal → digital sampling, where step size Δ\Delta is the sampling interval.

Starting from x˙(t)=Ax(t)+Bu(t)\dot{x}(t) = Ax(t) + Bu(t), integrating over the interval [kΔ,(k+1)Δ][k\Delta, (k+1)\Delta] and assuming u(t)u(t) is constant within the interval (Zero-Order Hold assumption), we get:

Aˉ=eΔA,Bˉ=A1(eΔAI)B\bar{A} = e^{\Delta A}, \quad \bar{B} = A^{-1}(e^{\Delta A} - I) \cdot B

The discretized recurrence formula:

xk=Aˉxk1+Bˉuk,yk=Cxkx_k = \bar{A} x_{k-1} + \bar{B} u_k, \quad y_k = C x_k

The simplest discretization is first-order Euler approximation: AˉI+ΔA\bar{A} \approx I + \Delta A, BˉΔB\bar{B} \approx \Delta B. Intuitive but low precision. Mamba and S4 actually use ZOH (higher precision). Different discretization methods produce different Aˉ\bar{A}, Bˉ\bar{B}, but all map the same continuous system to a discrete recurrence.

The step size Δ\Delta in the LTI (Linear Time-Invariant) context controls the system’s “temporal resolution”: analogous to audio sample rate — 44.1kHz (small Δ\Delta, high resolution) vs 8kHz (large Δ\Delta, low resolution). Note: in Section 4 on Mamba’s selective Δ\Delta, the meaning differs — large Δ\Delta means “reset & focus,” small Δ\Delta means “persist & ignore” (details in Section 4).


3. Duality of Recurrence and Convolution

Discrete SSMs possess an elegant property: the same model can be computed in two entirely different ways.

Recurrence mode: Step-by-step recursion xk=Aˉxk1+Bˉukx_k = \bar{A}x_{k-1} + \bar{B}u_k, with O(1)O(1) computation per step. Ideal for inference — each new token requires only a single state update.

Convolution mode: Unrolling the recursion reveals the convolution structure:

y0=CBˉu0,y1=CAˉBˉu0+CBˉu1,yk=j=0kCAˉkjBˉujy_0 = C\bar{B}u_0, \quad y_1 = C\bar{A}\bar{B}u_0 + C\bar{B}u_1, \quad y_k = \sum_{j=0}^{k} C\bar{A}^{k-j}\bar{B} \cdot u_j

This is exactly convolution y=Kˉuy = \bar{K} * u, with kernel Kˉi=CAˉiBˉ\bar{K}_i = C\bar{A}^i\bar{B}. Using FFT acceleration (convolution in time domain = multiplication in frequency domain): y=IFFT(FFT(Kˉ)FFT(u))y = \text{IFFT}(\text{FFT}(\bar{K}) \cdot \text{FFT}(u)), the entire sequence is computed in parallel in O(LlogL)O(L \log L) time (LL = sequence length). Ideal for training — all tokens are processed simultaneously.

Recurrence Mode (Inference)Recurrence (Sequential)Convolution (Parallel)Input uu1u2u3u4u5u6State xx1B̄ux2ĀB̄ux3ĀB̄ux4ĀB̄ux5ĀB̄ux6ĀB̄uOutput yy1y2y3y4y5y6O(1) per step · O(N) total · SequentialGood for inference: each new token needs only one state update

This duality is the core contribution of S4 (Structured State Spaces for Sequence Modeling, Gu et al., ICLR 2022): use convolution mode during training to fully exploit GPU parallelism, then switch to recurrence mode at inference for O(1)O(1) incremental generation.

S4’s technical key: Computing the convolution kernel Kˉi=CAˉiBˉ\bar{K}_i = C\bar{A}^i\bar{B} requires high powers of Aˉ\bar{A} — expensive for general matrices. S4 uses NPLR/DPLR parameterization (decomposing AA into Diagonal + Low-Rank), leveraging the Cauchy kernel formula for efficient computation. The simplified S4D directly diagonalizes AA, making Aˉi\bar{A}^i simply the ii-th power of diagonal elements.

S4’s landmark results: SoTA on all Long Range Arena tasks, including Path-X (16384-step sequence classification) where all prior methods had failed; Sequential CIFAR-10 at 91% (no data augmentation); generation speed 60x faster than same-parameter Transformers.


3.5 From S4 to Mamba: Why Selectivity Is Needed

Despite S4’s excellent performance on long-sequence tasks, it has a fundamental limitation — Linear Time-Invariance (LTI): all parameters (AA, BB, CC, Δ\Delta) remain fixed. This means the SSM processes “the” and “cat” in exactly the same way. From the convolution perspective: a fixed convolution kernel = a fixed filter, unable to select based on content.

The Selective Copying task (Mamba paper Section 4.1) clearly exposes this problem: the task requires selecting colored tokens from noisy tokens and copying them in order, with variable spacing. S4 (LTI) achieves only 18.3% accuracy, while Mamba (selective) reaches 99.8%. The reason: a static convolution kernel cannot handle selective copying with variable spacing.

H3 (Hungry Hungry Hippos, Fu et al., 2022) attempted a transitional approach: adding some Attention layers outside the SSM layers to supplement retrieval capability. A 125M H3-Attention hybrid model outperformed a pure Transformer. But this raised a more fundamental question: rather than adding Attention externally to compensate for SSM limitations, why not make the SSM itself selective? This line of thinking directly led to Mamba.


4. Mamba’s Selectivity Mechanism

Mamba (Gu & Dao, 2023) introduces the core innovation of Selectivity: making key parameters dependent on the current input, fundamentally solving the LTI limitation described in Section 3.5.

  • Selective BB: Bk=Linear(uk)B_k = \text{Linear}(u_k) — controls “what to write into the state”
  • Selective CC: Ck=Linear(uk)C_k = \text{Linear}(u_k) — controls “what to read from the state”
  • Selective Δ\Delta: Δk=softplus(Linear(uk))\Delta_k = \text{softplus}(\text{Linear}(u_k)) — controls “state update intensity”

The mechanism of Δ\Delta: since AA is initialized as negative, the behavior of Aˉ=eΔA\bar{A} = e^{\Delta A} depends on the magnitude of Δ\Delta. Large Δ\DeltaΔA\Delta A is a large negative number → Aˉ0\bar{A} \to 0, old state is heavily decayed, while Bˉ\bar{B} increases and current input is strongly written in — the effect is “reset & focus” (clear old state, focus on current token). Small Δ\DeltaAˉI\bar{A} \approx I, old state is almost fully preserved, while Bˉ0\bar{B} \to 0 and current input is nearly ignored — the effect is “persist & ignore” (keep old state, ignore current token).

For content words (“cat”, “mat”), the model outputs a large Δ\Delta to strongly write the current token into state; for function words (“the”, “on”), it outputs a small Δ\Delta to leave the state nearly unchanged.

RNN gate equivalence: Mamba paper Theorem 1 proves that when N=1N=1, A=1A=-1, B=1B=1, the selective SSM degenerates exactly to a gated RNN: gt=σ(Linear(xt))g_t = \sigma(\text{Linear}(x_t)), ht=(1gt)ht1+gtxth_t = (1-g_t)h_{t-1} + g_t x_t. (1gt)(1-g_t) controls forgetting, gtg_t controls writing — both coupled as a single scalar (unlike LSTM’s independent forget and input gates). In other words, the SSM’s discretization step Δ\Delta is the theoretical foundation for RNN’s heuristic gating mechanisms.

Mamba Selective Mechanism: Δ Controls State UpdateClick token to see its impact on stateThecatsatonthematΔ value0.150.850.720.120.100.88Ā=0.74Ā=0.18Ā=0.24Ā=0.79Ā=0.82Ā=0.17Large Δ = Reset & Focus (write current)Small Δ = Persist & Ignore (keep old state)Cumulative state (all tokens)0.213dim 0-0.106dim 10.295dim 20.164dim 3Selective Δ allows Mamba to adaptively focus on important tokens and ignore noise — similar to Attention's "soft selection"

Parallel Scan: Parallel Training for Selective SSMs

The cost of selectivity: once parameters depend on input, the SSM is no longer time-invariant (LTI), the convolution kernel Kˉ\bar{K} changes with input, and FFT acceleration is no longer possible. The recurrence xk=Aˉkxk1+Bˉkukx_k = \bar{A}_k x_{k-1} + \bar{B}_k u_k has parameters varying with kk — how to parallelize?

Key observation: the recurrence (xk,ak)=(akxk1+bk)(x_k, a_k) = (a_k \cdot x_{k-1} + b_k) is an associative operation. Two operations (a1,b1)(a_1, b_1) and (a2,b2)(a_2, b_2) can be combined as (a2a1,a2b1+b2)(a_2 a_1, a_2 b_1 + b_2). Associativity means we can use the parallel prefix sum algorithm — similar to parallel reduction on GPUs:

  • Work: O(L)O(L) (total computation same as sequential execution)
  • Span: O(logL)O(\log L) (critical path length, determines parallel time)
1. Sequential Scan — O(L) Steps
Selective SSM must compute step by stepx1x2x3x4x5x6x7x8xₖ = Āₖ · xₖ₋₁ + B̄ₖ · uₖ1234567Must wait for previous step → 7 steps for 8 elements

Hardware-aware optimization (Mamba paper Section 3.3.2): A naive implementation requires storing (Aˉ,Bˉ)(\bar{A}, \bar{B}) tensors of size (B,L,D,N)(B, L, D, N) in HBM — memory explosion. Mamba’s approach: load SSM parameters (Δ,A,B,C)(\Delta, A, B, C) directly from HBM to SRAM, perform discretization + scan in SRAM, and write only the final output (B,L,D)(B, L, D) back to HBM. During backpropagation, intermediate states are not stored but recomputed (similar to gradient checkpointing). Result: same memory footprint as an optimized Transformer with FlashAttention.

Mamba Block Architecture

Mamba merges the functionality of Attention and MLP from Transformers into a more streamlined block:

Mamba Block ArchitectureInput x(B, L, D)ResidualLinear ↑D → ED (expand)Conv1dkernel=4SiLU (σ)activationSSM (Selective)Δ, B, C = f(input)SiLU (gate)(B, L, ED)×Linear ↓ED → D (contract)+Output (B, L, D)No AttentionNo MLPVs Transformersimpler block

Architecture highlights:

  • Role of Conv1d: Provides local context (kernel size = 4), allowing the SSM to “see” neighboring tokens when making selective decisions. Without Conv1d, the SSM’s BB, CC, Δ\Delta would make decisions based solely on individual token embeddings.
  • Gating mechanism: The SiLU gate on the right branch is similar to GLU (Gated Linear Unit). Output = SSM_output ×\times σ\sigma(gate_input), controlling information flow and preventing over-activation.
  • Dimension changes: expand factor E=2E=2, Linear\uparrow maps D2DD \to 2D, Linear\downarrow maps 2DD2D \to D.
  • Comparison with Transformer: Transformer block = LayerNorm → Multi-Head Attention → Residual → LayerNorm → MLP → Residual; Mamba block = LayerNorm → Linear\uparrow → (Conv1d → SiLU → SSM) ×\times Gate → Linear\downarrow → Residual.

5. Mamba-2: State Space Duality

Mamba-2 (Dao & Gu, ICML 2024) establishes a deep mathematical connection between SSMs and Attention — State Space Duality (SSD).

Core finding: unrolling the SSM recurrence into matrix form Y=MUY = M \cdot U, the matrix MM is a semiseparable matrix. Any submatrix of the lower-triangular part of MM has rank N\leq N (state dimension). Intuition: since each step has only an NN-dimensional state “carrying” information, the matrix’s “information bandwidth” is bounded by NN. In contrast, for Attention: the rank of QKTQK^T = head dim, much higher than SSM’s NN, making it more expressive but also slower.

1. SSM → Semiseparable Matrix
SSM recurrence unrolled to matrix multiplication Y = M · UM (semiseparable)0.60.00.00.00.00.00.50.90.00.00.00.00.40.70.30.00.00.00.40.60.30.60.00.00.30.50.20.50.90.00.30.40.20.40.70.3·U (input)u1u2u3u4u5u6Semiseparable structure:M[i,j] = C·Ā^(i-j)·B̄ (j ≤ i)M[i,j] = 0 (j > i)Causal + exponential decay structureSSM recurrence xₖ = Āxₖ₋₁ + B̄uₖ expands to a structured matrix multiplication

Chunk-wise algorithm: Divide a length-LL sequence into L/QL/Q chunks, each of size QQ. Within each chunk, use matrix multiplication (Q×QQ \times Q matrix, leveraging tensor cores); between chunks, use SSM scan (passing only NN-dimensional state). Total computation O(LN2)O(LN^2) FLOPs, O(LN)O(LN) memory — similar to Flash Attention’s blocking strategy.

Multi-head SSM: Mamba-1 has head dim P=1P=1 (each feature dimension runs an independent SSM). Mamba-2 introduces head dim P=64P=64 or 128128 (multiple dimensions share AA, analogous to multi-head attention). Larger head dim makes matrix multiplication more efficient (GPUs prefer large matrices) while enhancing model expressiveness.

Structured masked attention perspective: In SSD’s dual form, Λij=t=j+1iat\Lambda_{ij} = \prod_{t=j+1}^{i} a_t is a product of input-dependent scalars. This can be seen as replacing Transformer’s heuristic positional encoding with a data-dependent positional mask. The main source of speedup is that the semiseparable structure allows sub-quadratic chunk-wise algorithms, not merely the removal of softmax.

Performance: Mamba-2’s core layer is 2-8x faster than Mamba-1’s fused scan; the crossover with FlashAttention-2 occurs at sequence length ~2K, and at 16K sequence length, it is ~6x faster than FlashAttention-2.


6. Benchmarks and Practical Comparison

All data below are from the Mamba paper (Gu & Dao, arXiv:2312.00752).

Language Model Quality

On the Pile dataset, Mamba outperforms same-scale Pythia (Transformer baseline) across all parameter scales:

ModelParametersPerplexity
Pythia-1.4B1.4B7.51
RWKV-1.5B1.5B7.70
Mamba-1.4B1.4B6.80
Pythia-2.8B2.8B6.73
Mamba-2.8B2.8B6.22

Downstream task average accuracy (Mamba paper Table 1): Mamba-1.4B achieves 59.7%, surpassing the 2x-larger Pythia-2.8B (59.1%); Mamba-2.8B achieves 63.3%, surpassing the 2.5x-larger Pythia-6.9B (61.7%). In other words, Mamba roughly matches Transformer quality at half the parameters.

Inference Throughput

Mamba is 5x faster than same-parameter Transformers (A100 80GB, prompt 2048 + gen 128).

Mamba vs Transformer PerformancePile Perplexity (lower is better)370M8.148.551.4B6.807.512.8B6.226.73MambaPythia (Transformer)Mamba-2.8B (63.3%) > Pythia-6.9B (61.7%)Inference ThroughputStrengths vs WeaknessesInference Throughput (A100 80GB)Transformer1×Mamba5×Mamba is 5× faster than same-size Transformer (prompt 2048 + gen 128)

SSM Weaknesses

SSM’s fixed NN-dimensional state is inherently lossy compression. When precise recall of a specific token from a long sequence is needed (e.g., “what was the 3000th token?”), an NN-dimensional vector cannot losslessly store all information from LNL \gg N tokens. This is not a bug, it’s a feature — SSMs excel at summarization and pattern recognition, not exact retrieval.

Specific weaknesses include:

  • Multi-Query Associative Recall (MQAR): Mamba-1 struggles on this task (Mamba-2 paper Figure 8)
  • In-context learning: Pure SSMs remain weaker than Attention; the Mamba-2 paper recommends hybrid architectures
  • Long-sequence exact copying: Fixed state dimensions limit exact copying capability

Summary

SSMs provide a fundamentally different sequence modeling paradigm from Attention:

AttentionSSM (LTI, e.g. S4)SSM (Selective, e.g. Mamba)
Training complexityO(L2)O(L^2)O(LlogL)O(L \log L) (FFT)O(L)O(L) (parallel scan)
Inference cacheO(L)O(L) (KV cache)O(1)O(1) (fixed state)O(1)O(1) (fixed state)
History accessExact (any token pair)Compressed + fixed patternCompressed + adaptive pattern
Content awarenessFull (QKVQKV all input-dependent)None (parameters fixed)Partial (BB, CC, Δ\Delta input-dependent)
Long-sequence ICLStrongWeakMedium (still weaker than Attention)
Core strengthExact retrievalParallel training + efficient inferenceSelectivity + linear complexity

The SSM development roadmap clearly shows the key problem solved at each step: HiPPO (2020) solved “how to initialize A to remember long history” → S4 (2021) solved “how to train efficiently” (convolution mode) → H3 (2022) discovered “SSMs need selectivity” → Mamba (2023) solved “how to make SSMs selectively process input” → Mamba-2 (2024) unified the mathematical frameworks of SSMs and Attention.

This fundamental limitation leads to the topic of the next article: how Hybrid architectures combine Attention’s precise retrieval with SSM’s efficient summarization, leveraging the strengths of both.

Further Reading