Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Attention Variants: From Sliding Window to MLA

Attention Variants: From Sliding Window to MLA

Updated 2026-04-06

Standard Multi-Head Attention faces three major bottlenecks: O(n2)O(n^2) computational complexity, KV cache memory consumption, and long-context handling. We previously saw how GQA/MQA reduces cache by sharing KV heads — here we introduce other important attention variants.

Four optimization directions:

  • Sparsification: Reduce the range each token needs to attend to (Sliding Window Attention)
  • KV compression: Reduce the amount of information stored at each position (MLA)
  • Linearization: Remove softmax, reduce O(n2)O(n^2) to O(n)O(n) (Linear Attention → GDN)
  • Hybrid architectures: Use different strategies at different layers, combining their strengths (Hybrid Attention)

Sliding Window Attention

Core idea: each token only attends to the preceding ww tokens, rather than the entire sequence.

Attention(qi,K,V)=softmax(qiK[iw+1:i]Tdk)V[iw+1:i]\text{Attention}(q_i, K, V) = \text{softmax}\left(\frac{q_i \cdot K_{[i-w+1:i]}^T}{\sqrt{d_k}}\right) V_{[i-w+1:i]}

This reduces complexity from O(n2)O(n^2) to O(nw)O(nw). Does this lose global information? Not really — after stacking multiple layers, the effective receptive field at layer LL is L×wL \times w. For example, Mistral 7B has 32 layers with window w=4096w=4096, giving a theoretical receptive field of 32×4096=13107232 \times 4096 = 131072 tokens.

Full Causal MaskSliding Window Mask (w=3)Key positionKey positionComputation ComparisonFullO(n²) = 36SWAO(nw) = 21 (42% less)

Adopters: Mistral 7B (w=4096w=4096), Mixtral 8x7B (w=4096w=4096), Gemma 2 (alternating layers).

Sliding Window also combines perfectly with Flash Attention — computation within the window uses Flash Attention’s tiling strategy efficiently, while everything outside the window is simply skipped.

Hybrid Attention

Not all layers need full attention — mixing different types of attention lets each type play to its strengths.

Common approaches:

  • Gemma 2: Even layers use full attention + odd layers use sliding window attention
  • Jamba (AI21): Alternating Attention layers + Mamba (SSM) layers in a 1:3 ratio
  • Command-R (Cohere): Some layers use full attention + others use local attention
Hybrid Attention Layer Configuration ComparisonGemma 2L0Full AttnL1SWAL2Full AttnL3SWAL4Full AttnL5SWAL6Full AttnL7SWAJambaL0AttentionL1MambaL2MambaL3AttentionL4MambaL5MambaL6AttentionL7MambaFull AttentionSliding WindowMamba (SSM)

The key design question: where to place full attention layers? What ratio to use? Experience shows that layers closer to the output need global information more (global layers go on top), while layers closer to the input can get by with local patterns alone.

Cross Attention

All the attention variants above are self-attention (Q/K/V come from the same sequence). The key difference with cross attention is that Q comes from one sequence, while K/V come from another sequence.

Self-Attention
Self-Attention: Q, K, V from the same sequenceInput Sequence XQ = Wq·XK = Wk·XV = Wv·XAttention(Q, K, V)All Q, K, V projected from the same input X

Typical Use Cases

Encoder-Decoder architectures (translation, summarization):

  • The decoder’s hidden state serves as Q
  • The encoder’s output serves as K and V
  • Each decoder token “queries” the encoder’s full input to decide which parts of the input to focus on
  • Adopters: T5, BART

Multimodal architectures (image-text understanding):

  • Text decoder tokens serve as Q
  • Image tokens from the vision encoder output serve as K and V
  • Text tokens “query” image tokens, fusing visual information
  • Adopters: Flamingo, LLaVA

Compared to self-attention, cross attention has different KV cache behavior: the KV comes from the encoder’s fixed output and does not grow as generation progresses.

Multi-Latent Attention (MLA)

GQA reduces cache by sharing KV heads — but can we be more aggressive? MLA’s approach is to apply low-rank compression to the KV cache, storing a low-dimensional compressed latent cKVc_{KV} instead of the full K and V.

Compression process:

cKV=WDKVh(h is the hidden state, dmodeldc)c_{KV} = W_{DKV} \cdot h \quad \text{(h is the hidden state, } d_{model} \to d_c \text{)}

K=WUKcKV,V=WUVcKV(decompression: dcdk,dv)K = W_{UK} \cdot c_{KV}, \quad V = W_{UV} \cdot c_{KV} \quad \text{(decompression: } d_c \to d_k, d_v \text{)}

Only cKVc_{KV} needs to be cached (its dimension is much smaller than K+V). During inference, WUKW_{UK} and WUVW_{UV} are used to decompress. Even better, WUKW_{UK} can be absorbed into the matrix multiplication with WQW_Q, avoiding explicit decompression.

MLA: Multi-Latent Attention Data FlowCache only low-dim c_KV (e.g. 512-dim), decompress to full K, V at inferenceHidden State hd_modelCompress W_DKVd → d_cCache c_KVd_c (small!)W_UK → Kd_c → d_kW_UV → Vd_c → d_vAttention OutputCache point — significant memory savingsInference optimization: W_UK can be absorbed into W_Q, avoiding explicit K decompression — further reduces computationAdopters: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1

The calculator below lets you compare KV cache sizes across different configurations for MHA, GQA, and MLA:

KV Cache Size Comparison (FP16, seq_len=4096)head_dim = d_model / num_heads = 4096 / 32 = 128MHA: 2 × 32 heads × 128 dim × 4096 seq × 2B = 64.0 MBGQA: 2 × 8 kv_heads × 128 dim × 4096 seq × 2B = 16.0 MBMLA: 512 latent_dim × 4096 seq × 2B = 4.0 MBMHA64.0 MB (100%)GQA (8h)16.0 MB (25%)MLA4.0 MB (6%)

Taking DeepSeek-V2 as an example (dmodel=5120d_{model}=5120, 128 heads, dc=512d_c=512): MLA’s KV cache is only about 5% of standard MHA.

Adopters: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1

Linear Attention and Gated Delta Net

The variants above all retain softmax — they just reduce computation range (SWA) or compress the cache (MLA). Linear Attention takes a more radical approach: remove softmax entirely, fundamentally eliminating O(n2)O(n^2) complexity.

Core Idea: Remove Softmax, Reorder Computation

Standard Attention computes:

Attn(Q,K,V)=softmax(QKT/d)n×nV\text{Attn}(Q,K,V) = \underbrace{\text{softmax}(QK^T / \sqrt{d})}_{n \times n} \cdot V

You must compute QKTQK^T (an n×nn \times n matrix) first, then softmax, then multiply by VV. Softmax is a row-wise normalization — it blocks matrix multiplication associativity. You cannot compute KTVK^TV before QQ, because softmax sits in between.

Linear Attention (Katharopoulos et al., 2020) had a key insight: replace softmax with a feature map ϕ\phi, enabling:

Attn(Q,K,V)=ϕ(Q)(ϕ(K)TV)d×d\text{Attn}(Q,K,V) = \phi(Q) \cdot \underbrace{(\phi(K)^T \cdot V)}_{d \times d}

ϕ(K)TV\phi(K)^T V is a d×dd \times d matrix (dd is the head dimension, typically 64-128), independent of sequence length nn. When ndn \gg d, this reduces complexity from O(n2d)O(n^2 d) to O(nd2)O(nd^2) — true linear complexity.

Standard Attentionsoftmax( Q · KT/ √d ) · Vsoftmax blocks associativity — must compute QKᵀ (n×n) first512×512 = 262,144O(n²d)Linear Attentionφ(Q) · ( φ(K)T· V )Remove softmax → compute KᵀV (d×d) first, then multiply Q64×64 = 4,096O(nd²)intermediate shrinks 64×
Sequence length n512

Why φ Unlocks Associativity

Softmax blocks associativity because it does two things: (1) exp() ensures non-negativity, (2) row-wise normalization (dividing by row sum). The normalization couples all elements within the same row — computing softmax(qiTk1)\text{softmax}(q_i^T k_1) requires knowing qiTk2,qiTk3,...,qiTknq_i^T k_2, q_i^T k_3, ..., q_i^T k_n. You cannot compute a single element independently; you must first compute the full row of QKTQK^T (n×nn \times n).

φ’s strategy: replace the coupled normalization with independent element-wise transforms. Replace the softmax kernel sim(q,k)=exp(qTk)/Z\text{sim}(q, k) = \exp(q^T k) / Z with a kernel function decomposition:

sim(q,k)=ϕ(q)Tϕ(k)\text{sim}(q, k) = \phi(q)^T \phi(k)

φ acts independently on each qq and kk (no need to know other keys’ values), and the dot product naturally satisfies associativity. Expanding the ii-th output (with normalization):

oi=ϕ(qi)Tjϕ(kj)vjTd×d, independent of iϕ(qi)Tjϕ(kj)d×1, independent of io_i = \frac{\phi(q_i)^T \overbrace{\sum_j \phi(k_j) v_j^T}^{d \times d,\ \text{independent of } i}}{\phi(q_i)^T \underbrace{\sum_j \phi(k_j)}_{d \times 1,\ \text{independent of } i}}

The sums in both numerator and denominator are independent of query position ii — they can be precomputed once (O(nd)O(nd)), then each query only requires an O(d)O(d) vector multiplication.

Choosing φ requires two properties: (1) non-negative output — ensures attention weights are non-negative, (2) element-wise independence — must not couple elements within the same row like softmax does. Common choices include ϕ(x)=elu(x)+1\phi(x) = \text{elu}(x) + 1 (Katharopoulos 2020’s original choice) and ϕ(x)=ReLU(x)\phi(x) = \text{ReLU}(x). Theoretically, Random Fourier Features can approximate softmax’s exp kernel, but are computationally expensive.

Why φ is fundamentally weaker than softmax: softmax’s exp + normalization naturally produces a sharp, sparse attention distribution (large values are exponentially amplified, small values are suppressed to near-zero), letting the model focus on the most relevant few tokens. Simple φ functions lack this “winner-take-all” effect — all keys contribute roughly equally. This is the fundamental motivation behind subsequent work (RetNet/DeltaNet/GDN) adding decay, delta rule, and gating — compensating for φ’s inability to replicate softmax’s selectivity.

RNN Form: Fixed-Size State

Without softmax, Linear Attention can be written as an RNN recurrence:

St=St1+ϕ(kt)ϕ(vt)T,ot=ϕ(qt)StS_t = S_{t-1} + \phi(k_t) \phi(v_t)^T, \quad o_t = \phi(q_t) \cdot S_t

StS_t is a d×dd \times d state matrix that compresses all history. During inference, no KV cache (which grows with sequence length) is needed — just this fixed-size state.

This form is strikingly similar to State Space Models (SSM):

Linear AttentionSSM / Mamba
State updateSt=St1+ktvtTS_t = S_{t-1} + k_t v_t^Tht=Aht1+Bxth_t = Ah_{t-1} + Bx_t
Outputot=qtSto_t = q_t S_tyt=Chty_t = Ch_t
State sized×dd \times d matrix (fixed)NN-dim vector (fixed)
Inference complexityO(1)O(1) per tokenO(1)O(1) per token

They are fundamentally the same: compress history into a fixed-size state, update via linear recurrence. Mamba-2’s SSD framework rigorously proved this equivalence — structured SSMs are a form of linear attention with decay.

The Cost: Softmax Isn’t Free to Remove

Softmax provides a sparse, sharp attention distribution — it lets the model focus on a few key tokens (e.g., precisely locating a specific name in a long document). Removing softmax makes the distribution “flat,” reducing the difference in contribution across tokens.

This causes pure Linear Attention to be far weaker than standard Attention on tasks requiring precise retrieval (copying, in-context learning). This is fundamentally the same limitation as SSM/Mamba — a fixed-size state cannot precisely remember specific information from arbitrarily long sequences. See Hybrid Architectures: Why Pure SSM Isn’t Enough.

From Accumulation to Error Correction: Evolution of State Updates

Basic Linear Attention only accumulates (St=St1+ktvtTS_t = S_{t-1} + k_t v_t^T), never forgets. Subsequent work focuses on making state updates smarter:

2020Basic Linear Attention2023RetNet (exponential decay)2024DeltaNet (Delta Rule)2024GDN (Gated + Delta Rule)
State Update
Sₜ = Sₜ₋₁ + kₜ vₜᵀ
Output: oₜ = qₜ Sₜ
Key Change: + kₜ vₜᵀ (direct accumulation)
Core Innovation
Remove softmax, exploit matrix associativity to reduce O(n²) to O(n). Can be written as RNN recurrence — inference only needs a fixed-size d×d state matrix
Limitation / Status
State only accumulates, never forgets — all history blends together, attention becomes "flat", much weaker than softmax attention
Paper: Katharopoulos et al.

Evolution:

  1. Basic Linear Attention (2020): Remove softmax, establish RNN form. But state only accumulates, large performance gap
  2. RetNet (2023, MSR): Add exponential decay factor γ\gamma, old information automatically fades each step. But γ\gamma is a fixed hyperparameter
  3. DeltaNet (2024, Yang et al.): Introduce the delta rule — instead of blind accumulation, first query what’s already in the state (ktTSt1k_t^T S_{t-1}), then write only the delta (“what’s missing”). This is an online learning rule for associative memory
  4. Gated Delta Net (GDN) (2024, ICLR 2025): Combines two complementary mechanisms — αt\alpha_t gating (from Mamba2) for selective forgetting + delta rule for targeted writing. Paper shows GDN outperforms Mamba2 on language modeling, in-context retrieval, and long-context tasks

Relationship to Mamba

Linear Attention and SSM/Mamba are two formulations of the same idea — the former starts from Attention (remove softmax), the latter from control theory (state space recurrence), converging at the same point.

GDN directly embodies this convergence: its paper title reads “Improving Mamba2 with Delta Rule” — combining Mamba2’s gating mechanism with linear attention’s delta rule. GDN’s hybrid architecture (alternating GDN layers + sliding window attention layers) follows exactly the same pattern as Jamba (alternating Mamba layers + attention layers).

The Linear Attention family and SSM/Mamba family are rapidly converging. Understanding one means understanding the core ideas of the other. For detailed SSM/Mamba principles, see State Space Models and Mamba. For hybrid architecture design, see Hybrid Architectures.

Comparison Summary

MethodComplexityKV CacheCore Idea
Full MHAO(n²d)2 × n_heads × d_head × seqEach head has independent Q/K/V, full attention
GQAO(n²d)2 × n_kv_heads × d_head × seqMultiple Q heads share KV heads, reduce KV cache
Sliding WindowO(nwd)2 × n_heads × d_head × wEach token attends to previous w tokens, stacking expands receptive field
Cross AttentionO(n·m·d)2 × n_heads × d_head × m (encoder)Q from decoder, KV from encoder/vision
MLAO(n²d)latent_dim × seq (极小)Low-rank compression of KV cache, store compressed latent
Hybrid混合分层不同Mix different attention types (full + SWA / SSM)

Selection guide:

  • Long context + low latency → Sliding Window (Mistral approach)
  • Maximum KV cache compression → MLA (DeepSeek approach)
  • Cross-modal / Encoder-Decoder → Cross Attention
  • Linear complexity + fixed state → Linear Attention / GDN (replaces KV cache entirely)
  • Balanced approach → Hybrid (Gemma 2 approach, or alternating GDN + SWA)
  • General KV savings → GQA (currently the most popular compromise)