State Space Models and Mamba
Updated 2026-04-13
Introduction
Standard Attention has computational complexity ( = sequence length), and the KV cache grows linearly with sequence length during inference. A 100K-token context means the Attention matrix has elements — memory and compute costs become an unavoidable bottleneck.
RNNs seem like a solution: inference complexity, requiring only a fixed-size hidden state per step. But traditional RNNs have two fatal weaknesses: (1) vanishing/exploding gradients prevent learning long-range dependencies; (2) the inherently sequential nature of step-by-step recursion makes training unparallelizable.
Is there an approach that solves both problems simultaneously — parallelizable training (like Attention) + inference (like RNNs) + theoretically learnable arbitrary long-range dependencies? State Space Models (SSMs) are exactly such an approach. Originating from control theory and signal processing, the core idea is to compress the entire history of a sequence into a fixed-size state vector, updating the state for each new token rather than looking back over the full history. The key SSM evolution: HiPPO (2020) → S4 (2021) → H3 (2022) → Mamba (2023) → Mamba-2 (2024).
1. Continuous State Space Models
The mathematical foundation of SSMs is a continuous-time linear dynamical system:
Here is the input signal, is the hidden state, and is the output. Matrix governs how the state evolves, controls how input is written into the state, and controls how the output is read from the state.
The term is a direct feedforward connection (skip connection). In S4 and Mamba implementations, it is typically set to 0 or identity, and does not affect the core SSM dynamics — we omit it hereafter.
Intuition: The state is a “compressed history summary.” Unlike Attention — which looks back over the entire token sequence at every step — an SSM maintains only a fixed-size state, with all historical information compressed into these dimensions. Think of it as a “filter with memory”: the input signal passes through the system, gets remembered and mixed by the state, then produces the output.
The above is the simplest single-input single-output (SISO) form. In practice, the SSM runs independently on each feature dimension of the model (similar to depth-wise convolution).
Signal processing analogy: An SSM is like a “memoryful IIR filter.” Input signal passes through the system; state acts as the filter’s internal registers; controls feedback between registers (pole locations); controls the path from input to registers; controls the path from registers to output.
Key question: The initialization of the matrix determines the “mathematical structure of memory.” Randomly initialized causes exponential information decay, preventing the model from learning long-range dependencies. This is precisely the problem that HiPPO, covered in the next section, aims to solve.
1.5 HiPPO: The Mathematical Foundation of SSM Memory
How can an -dimensional state vector “optimally” compress the history of a continuous signal? A randomly initialized matrix causes exponential information decay — the state only remembers the most recent steps, and distant information is rapidly lost.
HiPPO (High-order Polynomial Projection Operators, Gu et al., NeurIPS 2020) provides an elegant solution: approximate the signal history using an orthogonal polynomial basis (Legendre polynomials). The -th component of the state vector = the projection coefficient of the signal onto the -th Legendre polynomial. Thus, the -dimensional state is an -th order polynomial approximation of the signal history, rather than simple exponential decay.
The HiPPO-LegS matrix (HiPPO paper Section 3.2) has a special lower-triangular + diagonal structure. This specific matrix makes the state update exactly equivalent to “online computation of Legendre projection coefficients” — with each new token processed, the state vector automatically maintains an optimal polynomial approximation of the entire signal history.
Why it matters: S4 experiments demonstrated that S4 with HiPPO initialization achieved SoTA on Path-X (16384-step sequence classification), while random initialization completely failed. HiPPO provides the theoretical foundation for SSMs to learn arbitrarily long-range dependencies.
Mamba’s simplification: Mamba does not use the full HiPPO matrix, but instead uses the simpler S4D-Real initialization: . In code, A_log = log([1,...,N]) is stored as a parameter, and at computation time A = -exp(A_log). The negative sign ensures state decay (stability), and log-space storage ensures numerical stability. Mamba’s selectivity mechanism (input-dependent , , ) compensates for the reduced expressiveness of the simplified initialization.
2. Discretization: From Continuous to Sequential
Why discretize? Language models process discrete token sequences, but the SSM’s mathematics is in continuous time. Analogy: continuous signal → digital sampling, where step size is the sampling interval.
Starting from , integrating over the interval and assuming is constant within the interval (Zero-Order Hold assumption), we get:
The discretized recurrence formula:
The simplest discretization is first-order Euler approximation: , . Intuitive but low precision. Mamba and S4 actually use ZOH (higher precision). Different discretization methods produce different , , but all map the same continuous system to a discrete recurrence.
The step size in the LTI (Linear Time-Invariant) context controls the system’s “temporal resolution”: analogous to audio sample rate — 44.1kHz (small , high resolution) vs 8kHz (large , low resolution). Note: in Section 4 on Mamba’s selective , the meaning differs — large means “reset & focus,” small means “persist & ignore” (details in Section 4).
3. Duality of Recurrence and Convolution
Discrete SSMs possess an elegant property: the same model can be computed in two entirely different ways.
Recurrence mode: Step-by-step recursion , with computation per step. Ideal for inference — each new token requires only a single state update.
Convolution mode: Unrolling the recursion reveals the convolution structure:
This is exactly convolution , with kernel . Using FFT acceleration (convolution in time domain = multiplication in frequency domain): , the entire sequence is computed in parallel in time ( = sequence length). Ideal for training — all tokens are processed simultaneously.
This duality is the core contribution of S4 (Structured State Spaces for Sequence Modeling, Gu et al., ICLR 2022): use convolution mode during training to fully exploit GPU parallelism, then switch to recurrence mode at inference for incremental generation.
S4’s technical key: Computing the convolution kernel requires high powers of — expensive for general matrices. S4 uses NPLR/DPLR parameterization (decomposing into Diagonal + Low-Rank), leveraging the Cauchy kernel formula for efficient computation. The simplified S4D directly diagonalizes , making simply the -th power of diagonal elements.
S4’s landmark results: SoTA on all Long Range Arena tasks, including Path-X (16384-step sequence classification) where all prior methods had failed; Sequential CIFAR-10 at 91% (no data augmentation); generation speed 60x faster than same-parameter Transformers.
3.5 From S4 to Mamba: Why Selectivity Is Needed
Despite S4’s excellent performance on long-sequence tasks, it has a fundamental limitation — Linear Time-Invariance (LTI): all parameters (, , , ) remain fixed. This means the SSM processes “the” and “cat” in exactly the same way. From the convolution perspective: a fixed convolution kernel = a fixed filter, unable to select based on content.
The Selective Copying task (Mamba paper Section 4.1) clearly exposes this problem: the task requires selecting colored tokens from noisy tokens and copying them in order, with variable spacing. S4 (LTI) achieves only 18.3% accuracy, while Mamba (selective) reaches 99.8%. The reason: a static convolution kernel cannot handle selective copying with variable spacing.
H3 (Hungry Hungry Hippos, Fu et al., 2022) attempted a transitional approach: adding some Attention layers outside the SSM layers to supplement retrieval capability. A 125M H3-Attention hybrid model outperformed a pure Transformer. But this raised a more fundamental question: rather than adding Attention externally to compensate for SSM limitations, why not make the SSM itself selective? This line of thinking directly led to Mamba.
4. Mamba’s Selectivity Mechanism
Mamba (Gu & Dao, 2023) introduces the core innovation of Selectivity: making key parameters dependent on the current input, fundamentally solving the LTI limitation described in Section 3.5.
- Selective : — controls “what to write into the state”
- Selective : — controls “what to read from the state”
- Selective : — controls “state update intensity”
The mechanism of : since is initialized as negative, the behavior of depends on the magnitude of . Large → is a large negative number → , old state is heavily decayed, while increases and current input is strongly written in — the effect is “reset & focus” (clear old state, focus on current token). Small → , old state is almost fully preserved, while and current input is nearly ignored — the effect is “persist & ignore” (keep old state, ignore current token).
For content words (“cat”, “mat”), the model outputs a large to strongly write the current token into state; for function words (“the”, “on”), it outputs a small to leave the state nearly unchanged.
RNN gate equivalence: Mamba paper Theorem 1 proves that when , , , the selective SSM degenerates exactly to a gated RNN: , . controls forgetting, controls writing — both coupled as a single scalar (unlike LSTM’s independent forget and input gates). In other words, the SSM’s discretization step is the theoretical foundation for RNN’s heuristic gating mechanisms.
Parallel Scan: Parallel Training for Selective SSMs
The cost of selectivity: once parameters depend on input, the SSM is no longer time-invariant (LTI), the convolution kernel changes with input, and FFT acceleration is no longer possible. The recurrence has parameters varying with — how to parallelize?
Key observation: the recurrence is an associative operation. Two operations and can be combined as . Associativity means we can use the parallel prefix sum algorithm — similar to parallel reduction on GPUs:
- Work: (total computation same as sequential execution)
- Span: (critical path length, determines parallel time)
Hardware-aware optimization (Mamba paper Section 3.3.2): A naive implementation requires storing tensors of size in HBM — memory explosion. Mamba’s approach: load SSM parameters directly from HBM to SRAM, perform discretization + scan in SRAM, and write only the final output back to HBM. During backpropagation, intermediate states are not stored but recomputed (similar to gradient checkpointing). Result: same memory footprint as an optimized Transformer with FlashAttention.
Mamba Block Architecture
Mamba merges the functionality of Attention and MLP from Transformers into a more streamlined block:
Architecture highlights:
- Role of Conv1d: Provides local context (kernel size = 4), allowing the SSM to “see” neighboring tokens when making selective decisions. Without Conv1d, the SSM’s , , would make decisions based solely on individual token embeddings.
- Gating mechanism: The SiLU gate on the right branch is similar to GLU (Gated Linear Unit). Output = SSM_output (gate_input), controlling information flow and preventing over-activation.
- Dimension changes: expand factor , Linear maps , Linear maps .
- Comparison with Transformer: Transformer block = LayerNorm → Multi-Head Attention → Residual → LayerNorm → MLP → Residual; Mamba block = LayerNorm → Linear → (Conv1d → SiLU → SSM) Gate → Linear → Residual.
5. Mamba-2: State Space Duality
Mamba-2 (Dao & Gu, ICML 2024) establishes a deep mathematical connection between SSMs and Attention — State Space Duality (SSD).
Core finding: unrolling the SSM recurrence into matrix form , the matrix is a semiseparable matrix. Any submatrix of the lower-triangular part of has rank (state dimension). Intuition: since each step has only an -dimensional state “carrying” information, the matrix’s “information bandwidth” is bounded by . In contrast, for Attention: the rank of = head dim, much higher than SSM’s , making it more expressive but also slower.
Chunk-wise algorithm: Divide a length- sequence into chunks, each of size . Within each chunk, use matrix multiplication ( matrix, leveraging tensor cores); between chunks, use SSM scan (passing only -dimensional state). Total computation FLOPs, memory — similar to Flash Attention’s blocking strategy.
Multi-head SSM: Mamba-1 has head dim (each feature dimension runs an independent SSM). Mamba-2 introduces head dim or (multiple dimensions share , analogous to multi-head attention). Larger head dim makes matrix multiplication more efficient (GPUs prefer large matrices) while enhancing model expressiveness.
Structured masked attention perspective: In SSD’s dual form, is a product of input-dependent scalars. This can be seen as replacing Transformer’s heuristic positional encoding with a data-dependent positional mask. The main source of speedup is that the semiseparable structure allows sub-quadratic chunk-wise algorithms, not merely the removal of softmax.
Performance: Mamba-2’s core layer is 2-8x faster than Mamba-1’s fused scan; the crossover with FlashAttention-2 occurs at sequence length ~2K, and at 16K sequence length, it is ~6x faster than FlashAttention-2.
6. Benchmarks and Practical Comparison
All data below are from the Mamba paper (Gu & Dao, arXiv:2312.00752).
Language Model Quality
On the Pile dataset, Mamba outperforms same-scale Pythia (Transformer baseline) across all parameter scales:
| Model | Parameters | Perplexity |
|---|---|---|
| Pythia-1.4B | 1.4B | 7.51 |
| RWKV-1.5B | 1.5B | 7.70 |
| Mamba-1.4B | 1.4B | 6.80 |
| Pythia-2.8B | 2.8B | 6.73 |
| Mamba-2.8B | 2.8B | 6.22 |
Downstream task average accuracy (Mamba paper Table 1): Mamba-1.4B achieves 59.7%, surpassing the 2x-larger Pythia-2.8B (59.1%); Mamba-2.8B achieves 63.3%, surpassing the 2.5x-larger Pythia-6.9B (61.7%). In other words, Mamba roughly matches Transformer quality at half the parameters.
Inference Throughput
Mamba is 5x faster than same-parameter Transformers (A100 80GB, prompt 2048 + gen 128).
SSM Weaknesses
SSM’s fixed -dimensional state is inherently lossy compression. When precise recall of a specific token from a long sequence is needed (e.g., “what was the 3000th token?”), an -dimensional vector cannot losslessly store all information from tokens. This is not a bug, it’s a feature — SSMs excel at summarization and pattern recognition, not exact retrieval.
Specific weaknesses include:
- Multi-Query Associative Recall (MQAR): Mamba-1 struggles on this task (Mamba-2 paper Figure 8)
- In-context learning: Pure SSMs remain weaker than Attention; the Mamba-2 paper recommends hybrid architectures
- Long-sequence exact copying: Fixed state dimensions limit exact copying capability
Summary
SSMs provide a fundamentally different sequence modeling paradigm from Attention:
| Attention | SSM (LTI, e.g. S4) | SSM (Selective, e.g. Mamba) | |
|---|---|---|---|
| Training complexity | (FFT) | (parallel scan) | |
| Inference cache | (KV cache) | (fixed state) | (fixed state) |
| History access | Exact (any token pair) | Compressed + fixed pattern | Compressed + adaptive pattern |
| Content awareness | Full ( all input-dependent) | None (parameters fixed) | Partial (, , input-dependent) |
| Long-sequence ICL | Strong | Weak | Medium (still weaker than Attention) |
| Core strength | Exact retrieval | Parallel training + efficient inference | Selectivity + linear complexity |
The SSM development roadmap clearly shows the key problem solved at each step: HiPPO (2020) solved “how to initialize A to remember long history” → S4 (2021) solved “how to train efficiently” (convolution mode) → H3 (2022) discovered “SSMs need selectivity” → Mamba (2023) solved “how to make SSMs selectively process input” → Mamba-2 (2024) unified the mathematical frameworks of SSMs and Attention.
This fundamental limitation leads to the topic of the next article: how Hybrid architectures combine Attention’s precise retrieval with SSM’s efficient summarization, leveraging the strengths of both.
Further Reading
- Attention Computation — Understanding the standard Attention computation process
- Flash Attention — Another approach to optimizing Attention efficiency
- Attention Variants — Sliding Window, MLA, and other Attention optimizations
- Hybrid Architectures — Fusing Mamba with Attention