Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Hybrid Architectures: Fusing Mamba with Attention

Hybrid Architectures: Fusing Mamba with Attention

Updated 2026-04-06

Why Pure SSM Is Not Enough

In the previous article, we saw that SSM/Mamba achieves linear complexity and constant inference cache through a fixed-size state vector xRNx \in \mathbb{R}^N. But this very advantage is also its fundamental limitation: an N-dimensional state vector cannot precisely store the complete information of M tokens (when M >> N, information loss is inevitable).

This limitation is most clearly exposed in the copying task. Given input “A B C D | ? ? ? ?”, the model needs to precisely copy the first half to the second half. Transformer’s Attention matrix can directly draw connections from output positions to source tokens, precisely copying sequences of arbitrary length. SSM, however, must compress all source tokens into its fixed-size state — earlier tokens suffer more information decay, and accuracy degrades as sequences grow longer.

Jelassi et al. (2024) rigorously proved in the “Repeat After Me” paper: Transformers need only two layers to copy exponentially long strings, while SSMs cannot. This is not an engineering problem but a fundamental theoretical limitation of the architecture.

Copying Task: Transformer vs SSMTask: Precisely copy tokens before separator to second halfTransformerSSMABCD|????ABCDAttention directly connects to source tokens → precise copyAccuracy100% (any length)Each position in Attention matrix can directly access any historical tokenTwo-layer Transformer can copy exponential-length strings (Jelassi et al. 2024)

NVIDIA found in 8B-scale experiments that pure Mamba-2 models score an average of 2.65 points lower than pure Transformers across 12 standard benchmarks. Interestingly, the 8B Mamba-2-Hybrid (mixed architecture) actually scores an average of 2.65 points higher than pure Transformer. This finding leads to the current consensus: Hybrid architecture is the strictly better solution — pure Attention is too expensive, pure SSM is too weak, and mixing is the optimal approach.

Three Fusion Paradigms

How should SSM and Attention be combined? There are currently three mainstream fusion paradigms, each with its own trade-offs:

Interleaved: Mamba layers and Attention layers are stacked in alternating fashion at a fixed ratio. For example, in every 8 layers, 7 use Mamba and 1 uses Attention. Simple to implement with flexible control over the SSM:Attention ratio. Most layers use Mamba to save KV cache, while a few Attention layers ensure in-context learning capability. The representative model is AI21’s Jamba.

Parallel: Within the same layer, SSM heads and Attention heads run in parallel, with outputs combined via addition. Each layer simultaneously benefits from precise retrieval (from Attention) and efficient summarization (from SSM). Implementation is more complex, requiring coordination of dimensions and fusion methods between the two types of heads. The representative model is NVIDIA’s Hymba.

Shared: A small number of Attention layers are reused across multiple positions by Mamba layers (parameter sharing). Extreme parameter efficiency — just 2 Attention layers can compensate for SSM’s copying/ICL deficiencies, with LoRA adapters providing specialization at different invocation positions. The representative model is Zyphra’s Zamba2.

Three Hybrid Fusion ParadigmsInterleavedParallelSharedMamba 1Mamba 2Mamba 3Mamba 4Mamba 5Mamba 6Mamba 7Attention 1Mamba 9Mamba 10Mamba 11Mamba 12Mamba 13Mamba 14Mamba 15Attention 2MoEMoEMoEMoE7:1Jamba style: 7 Mamba + 1 Attention per group, with MoE on some layersKV cache only 1/8 · Suitable for long-context large models · Example: Jamba (52B/12B active)

The choice among the three paradigms depends on the target scenario:

  • Long context + large model -> Interleaved (high SSM ratio compresses KV cache)
  • Small model + edge deployment -> Shared (highest parameter efficiency)
  • Strong ICL needs + precise retrieval -> Parallel (every layer has Attention)

General rule: higher SSM ratio -> better long-sequence efficiency but weaker ICL; higher Attention ratio -> the opposite.

Jamba: Large-Scale Interleaved Hybrid

Jamba (AI21 Labs, 2024) is the first production model to successfully deploy a Hybrid architecture at large scale (52B parameters).

Architecture design: 80 layers, grouped in sets of 8: 7 Mamba layers + 1 Attention layer. Some layers integrate MoE (16 experts, top-2 routing), resulting in 52B total parameters but only 12B active parameters per token.

Jamba 架构:交替式 Hybrid + MoE52B total / 12B active · 256K context · 单卡 80GB 可部署MambaMamba16E/2AMambaMambaMamba16E/2AMambaMambaAttentionMambaMamba16E/2AMambaMambaMamba16E/2AMambaMambaAttention×10×10KV Cache 仅此层生成KV cache 大小 = 1/8 纯 Transformer固定大小状态向量不随序列长度增长参数分布Mamba 层 45%Attention 层 15%MoE experts 40%Mamba 层Attention 层MoE (16E/2A)

Design motivation: At 256K context, a pure Transformer’s KV cache would consume enormous amounts of memory. Jamba’s KV cache is only generated at Attention layers (1/8 of total layers), while Mamba layers use fixed-size state, making the KV cache only 1/8 the size of an equivalent pure Transformer.

Key data:

  • 256K context window, supporting extremely long text
  • 52B total parameters / 12B active parameters -> deployable on a single 80GB GPU
  • At comparable parameter scale: significantly outperforms pure Transformer on long-context tasks, matches or slightly exceeds on short-context tasks

Jamba validated an important conclusion: most Transformer layers can be replaced with Mamba without quality loss, and a small number of Attention layers suffice to maintain ICL capability. This 7:1 ratio has become the reference baseline for subsequent interleaved Hybrid designs.

Zamba2: Parameter-Efficient Shared Hybrid

Zamba2 (Zyphra, 2024) took an entirely different route: extreme parameter efficiency.

Architecture design: Mamba2 backbone + only 2 shared Attention layers. These 2 Attention layers are reused at multiple positions in an ABAB pattern — their Q/K/V projection weights are shared, but each invocation position has independent LoRA adapters for specialization. This means different positions share most Attention parameters, yet through low-rank adjustments can still exhibit position-specific behavior.

Core innovations:

  • Shared Attention parameters + LoRA specialization: Achieves both parameter efficiency and inter-layer differentiation
  • Embedding concatenation: Concatenates the original embedding to each Attention block’s input, preventing deep-layer information degradation
  • Mamba2 backbone: Leverages SSD’s chunk-wise algorithm to accelerate training

Key data (2.7B parameters):

  • Inference efficiency equivalent to a 1-2B pure Transformer
  • Output quality equivalent to a 3-4B pure Transformer
  • Compared to Phi-3 3.8B: 2x faster TTFT (Time to First Token), 27% less memory, 1.29x lower generation latency

Zamba2’s insight is: Attention’s role is to “complement” SSM’s weaknesses, not replace it. Just 2 shared Attention layers are sufficient — they primarily handle the precise retrieval tasks that SSM cannot do well. This makes Zamba2 an ideal choice for edge deployment.

Hymba: Parallel Fusion + Meta Tokens

Hymba (NVIDIA, 2024) proposed the most fine-grained fusion approach: running Attention heads and SSM heads simultaneously within every layer.

Architecture design: Each layer contains two groups of heads: Attention heads and SSM heads. Input token embeddings are fed to both groups simultaneously; they compute independently and their outputs are summed. Additionally, Hymba introduces Meta Tokens — a set of learnable token prefixes prepended to the input sequence. Meta tokens store global key information (such as language features, task type), reducing the amount of information Attention needs to retrieve from actual tokens.

1. Input + Meta Tokens
Meta Tokens concatenated before input sequenceM₁M₂M₃+t₁t₂t₃t₄t₅Concatenated sequence: [M₁, M₂, M₃, t₁, t₂, t₃, t₄, t₅]Meta tokens are learnable parameters, storing global key information (e.g., task type, language features)Reduce effective sequence length for Attention processing → Compress KV cache

Further optimizations:

  • Cross-layer KV sharing: Adjacent layers share Attention’s KV cache, reducing storage overhead
  • Partial sliding window attention: Some Attention heads use local windows instead of global Attention, further compressing computation

Key data (1.5B parameters):

  • Outperforms Llama-3.2-3B (+1.32% average score) with only half the parameters
  • KV cache 11.67x smaller than Llama-3.2-3B
  • Throughput 3.49x higher

Hymba’s parallel design ensures every layer simultaneously leverages Attention’s precise retrieval and SSM’s efficient summarization. The trade-off is the highest implementation complexity, but also the best results — especially suitable for small models that need strong ICL capability.

Model Comparison

The table below summarizes key metrics of major Hybrid models. Throughput and KV cache size are relative to a comparable pure Transformer baseline:

Hybrid Model ComparisonHover to view detailed model informationModelTotal ParamsActiveSSM:AttnThroughputAvg ScoreKV CacheJamba52B12B7:11.6×72.11/8Zamba22.7B2.7B~6:12.0×68.51/6Hymba1.5B1.5B1:13.49×67.31/12Transformer0:N1.0× (baseline)Baseline1× (baseline)MambaN:0~5×baseline-2.65O(1)

Several key observations:

  1. No single optimal ratio: Jamba uses 7:1, Zamba2 uses ~6:1, Hymba uses 1:1 — the optimal SSM:Attention ratio varies by model scale and target task
  2. Hybrid consistently outperforms both extremes: Neither pure Transformer (too expensive) nor pure Mamba (too weak) is the optimal solution
  3. KV cache compression is the core benefit: All Hybrid models have significantly smaller KV cache than pure Transformers

Summary and Outlook

Hybrid architectures represent the current consensus direction for sequence modeling:

DimensionPure AttentionPure SSMHybrid
Training efficiencyO(N2)O(N^2)O(N)O(N)O(N)O(N)~O(N2)O(N^2)
Inference KV cacheO(N)O(N)O(1)O(1)Significantly reduced
ICL / CopyingExactLimitedExact (via Attention layers)
Long sequence supportDifficultNativeGood
Engineering complexityMatureModerateHigher

Trends and open questions:

  • Optimal mixing strategy is still being explored: How to automatically search for the optimal SSM:Attention ratio and placement?
  • Can SSMs overcome the copying limitation? Larger state dimensions or new state update mechanisms may narrow the gap
  • Hardware co-design: Will future chips provide dedicated hardware support for SSM scan operations?
  • Unified framework: Mamba-2’s SSD framework hints that Attention and SSM may ultimately unify as different special cases of the same mathematical framework

Further Reading