Attention Variants: From Sliding Window to MLA

Standard Multi-Head Attention faces three major bottlenecks: $O(n^2)$ computational complexity, KV cache memory consumption, and long-context handling. We previously saw how GQA/MQA reduces cache by sharing KV heads — here we introduce other important attention variants.

Four optimization directions:

Sparsification: Reduce the range each token needs to attend to (Sliding Window Attention)
KV compression: Reduce the amount of information stored at each position (MLA)
Linearization: Remove softmax, reduce $O(n^2)$ to $O(n)$ (Linear Attention → GDN)
Hybrid architectures: Use different strategies at different layers, combining their strengths (Hybrid Attention)

Sliding Window Attention

Core idea: each token only attends to the preceding $w$ tokens, rather than the entire sequence.

$\text{Attention}(q_i, K, V) = \text{softmax}\left(\frac{q_i \cdot K_{[i-w+1:i]}^T}{\sqrt{d_k}}\right) V_{[i-w+1:i]}$

This reduces complexity from $O(n^2)$ to $O(nw)$ . Does this lose global information? Not really — after stacking multiple layers, the effective receptive field at layer $L$ is $L \times w$ . For example, Mistral 7B has 32 layers with window $w=4096$ , giving a theoretical receptive field of $32 \times 4096 = 131072$ tokens.

Sequence length n=8Window size w=3

Adopters: Mistral 7B ( $w=4096$ ), Mixtral 8x7B ( $w=4096$ ), Gemma 2 (alternating layers).

Sliding Window also combines perfectly with Flash Attention — computation within the window uses Flash Attention’s tiling strategy efficiently, while everything outside the window is simply skipped.

Hybrid Attention

Not all layers need full attention — mixing different types of attention lets each type play to its strengths.

Common approaches:

Gemma 2: Even layers use full attention + odd layers use sliding window attention
Jamba (AI21): Alternating Attention layers + Mamba (SSM) layers in a 1:3 ratio
Command-R (Cohere): Some layers use full attention + others use local attention

The key design question: where to place full attention layers? What ratio to use? Experience shows that layers closer to the output need global information more (global layers go on top), while layers closer to the input can get by with local patterns alone.

Cross Attention

All the attention variants above are self-attention (Q/K/V come from the same sequence). The key difference with cross attention is that Q comes from one sequence, while K/V come from another sequence.

Self-Attention

Typical Use Cases

Encoder-Decoder architectures (translation, summarization):

The decoder’s hidden state serves as Q
The encoder’s output serves as K and V
Each decoder token “queries” the encoder’s full input to decide which parts of the input to focus on
Adopters: T5, BART

Multimodal architectures (image-text understanding):

Text decoder tokens serve as Q
Image tokens from the vision encoder output serve as K and V
Text tokens “query” image tokens, fusing visual information
Adopters: Flamingo, LLaVA

Compared to self-attention, cross attention has different KV cache behavior: the KV comes from the encoder’s fixed output and does not grow as generation progresses.

Multi-Latent Attention (MLA)

GQA reduces cache by sharing KV heads — but can we be more aggressive? MLA’s approach is to apply low-rank compression to the KV cache, storing a low-dimensional compressed latent $c_{KV}$ instead of the full K and V.

Compression process:

$c_{KV} = W_{DKV} \cdot h \quad \text{(h is the hidden state, } d_{model} \to d_c \text{)}$

$K = W_{UK} \cdot c_{KV}, \quad V = W_{UV} \cdot c_{KV} \quad \text{(decompression: } d_c \to d_k, d_v \text{)}$

Only $c_{KV}$ needs to be cached (its dimension is much smaller than K+V). During inference, $W_{UK}$ and $W_{UV}$ are used to decompress. Even better, $W_{UK}$ can be absorbed into the matrix multiplication with $W_Q$ , avoiding explicit decompression.

The calculator below lets you compare KV cache sizes across different configurations for MHA, GQA, and MLA:

d_model: 4096

num_heads: 32

GQA kv_heads: 8

MLA latent_dim: 512

seq_len: 4096

Taking DeepSeek-V2 as an example ( $d_{model}=5120$ , 128 heads, $d_c=512$ ): MLA’s KV cache is only about 5% of standard MHA.

Adopters: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1

Linear Attention and Gated Delta Net

The variants above all retain softmax — they just reduce computation range (SWA) or compress the cache (MLA). Linear Attention takes a more radical approach: remove softmax entirely, fundamentally eliminating $O(n^2)$ complexity.

Core Idea: Remove Softmax, Reorder Computation

Standard Attention computes:

$\text{Attn}(Q,K,V) = \underbrace{\text{softmax}(QK^T / \sqrt{d})}_{n \times n} \cdot V$

You must compute $QK^T$ (an $n \times n$ matrix) first, then softmax, then multiply by $V$ . Softmax is a row-wise normalization — it blocks matrix multiplication associativity. You cannot compute $K^TV$ before $Q$ , because softmax sits in between.

Linear Attention (Katharopoulos et al., 2020) had a key insight: replace softmax with a feature map $\phi$ , enabling:

$\text{Attn}(Q,K,V) = \phi(Q) \cdot \underbrace{(\phi(K)^T \cdot V)}_{d \times d}$

$\phi(K)^T V$ is a $d \times d$ matrix ( $d$ is the head dimension, typically 64-128), independent of sequence length $n$ . When $n \gg d$ , this reduces complexity from $O(n^2 d)$ to $O(nd^2)$ — true linear complexity.

Sequence length n512

Why φ Unlocks Associativity

Softmax blocks associativity because it does two things: (1) exp() ensures non-negativity, (2) row-wise normalization (dividing by row sum). The normalization couples all elements within the same row — computing $\text{softmax}(q_i^T k_1)$ requires knowing $q_i^T k_2, q_i^T k_3, ..., q_i^T k_n$ . You cannot compute a single element independently; you must first compute the full row of $QK^T$ ( $n \times n$ ).

φ’s strategy: replace the coupled normalization with independent element-wise transforms. Replace the softmax kernel $\text{sim}(q, k) = \exp(q^T k) / Z$ with a kernel function decomposition:

$\text{sim}(q, k) = \phi(q)^T \phi(k)$

φ acts independently on each $q$ and $k$ (no need to know other keys’ values), and the dot product naturally satisfies associativity. Expanding the $i$ -th output (with normalization):

$o_i = \frac{\phi(q_i)^T \overbrace{\sum_j \phi(k_j) v_j^T}^{d \times d,\ \text{independent of } i}}{\phi(q_i)^T \underbrace{\sum_j \phi(k_j)}_{d \times 1,\ \text{independent of } i}}$

The sums in both numerator and denominator are independent of query position $i$ — they can be precomputed once ( $O(nd)$ ), then each query only requires an $O(d)$ vector multiplication.

Choosing φ requires two properties: (1) non-negative output — ensures attention weights are non-negative, (2) element-wise independence — must not couple elements within the same row like softmax does. Common choices include $\phi(x) = \text{elu}(x) + 1$ (Katharopoulos 2020’s original choice) and $\phi(x) = \text{ReLU}(x)$ . Theoretically, Random Fourier Features can approximate softmax’s exp kernel, but are computationally expensive.

Why φ is fundamentally weaker than softmax: softmax’s exp + normalization naturally produces a sharp, sparse attention distribution (large values are exponentially amplified, small values are suppressed to near-zero), letting the model focus on the most relevant few tokens. Simple φ functions lack this “winner-take-all” effect — all keys contribute roughly equally. This is the fundamental motivation behind subsequent work (RetNet/DeltaNet/GDN) adding decay, delta rule, and gating — compensating for φ’s inability to replicate softmax’s selectivity.

RNN Form: Fixed-Size State

Without softmax, Linear Attention can be written as an RNN recurrence:

$S_t = S_{t-1} + \phi(k_t) \phi(v_t)^T, \quad o_t = \phi(q_t) \cdot S_t$

$S_t$ is a $d \times d$ state matrix that compresses all history. During inference, no KV cache (which grows with sequence length) is needed — just this fixed-size state.

This form is strikingly similar to State Space Models (SSM):

	Linear Attention	SSM / Mamba
State update	$S_t = S_{t-1} + k_t v_t^T$	$h_t = Ah_{t-1} + Bx_t$
Output	$o_t = q_t S_t$	$y_t = Ch_t$
State size	$d \times d$ matrix (fixed)	$N$ -dim vector (fixed)
Inference complexity	$O(1)$ per token	$O(1)$ per token

They are fundamentally the same: compress history into a fixed-size state, update via linear recurrence. Mamba-2’s SSD framework rigorously proved this equivalence — structured SSMs are a form of linear attention with decay.

The Cost: Softmax Isn’t Free to Remove

Softmax provides a sparse, sharp attention distribution — it lets the model focus on a few key tokens (e.g., precisely locating a specific name in a long document). Removing softmax makes the distribution “flat,” reducing the difference in contribution across tokens.

This causes pure Linear Attention to be far weaker than standard Attention on tasks requiring precise retrieval (copying, in-context learning). This is fundamentally the same limitation as SSM/Mamba — a fixed-size state cannot precisely remember specific information from arbitrarily long sequences. See Hybrid Architectures: Why Pure SSM Isn’t Enough.

From Accumulation to Error Correction: Evolution of State Updates

Basic Linear Attention only accumulates ( $S_t = S_{t-1} + k_t v_t^T$ ), never forgets. Subsequent work focuses on making state updates smarter:

State Update

Sₜ = Sₜ₋₁ + kₜ vₜᵀ

Output: oₜ = qₜ Sₜ

Key Change: + kₜ vₜᵀ (direct accumulation)

Core Innovation

Remove softmax, exploit matrix associativity to reduce O(n²) to O(n). Can be written as RNN recurrence — inference only needs a fixed-size d×d state matrix

Limitation / Status

State only accumulates, never forgets — all history blends together, attention becomes "flat", much weaker than softmax attention

Paper: Katharopoulos et al.

Evolution:

Basic Linear Attention (2020): Remove softmax, establish RNN form. But state only accumulates, large performance gap
RetNet (2023, MSR): Add exponential decay factor $\gamma$ , old information automatically fades each step. But $\gamma$ is a fixed hyperparameter
DeltaNet (2024, Yang et al.): Introduce the delta rule — instead of blind accumulation, first query what’s already in the state ( $k_t^T S_{t-1}$ ), then write only the delta (“what’s missing”). This is an online learning rule for associative memory
Gated Delta Net (GDN) (2024, ICLR 2025): Combines two complementary mechanisms — $\alpha_t$ gating (from Mamba2) for selective forgetting + delta rule for targeted writing. Paper shows GDN outperforms Mamba2 on language modeling, in-context retrieval, and long-context tasks

Relationship to Mamba

Linear Attention and SSM/Mamba are two formulations of the same idea — the former starts from Attention (remove softmax), the latter from control theory (state space recurrence), converging at the same point.

GDN directly embodies this convergence: its paper title reads “Improving Mamba2 with Delta Rule” — combining Mamba2’s gating mechanism with linear attention’s delta rule. GDN’s hybrid architecture (alternating GDN layers + sliding window attention layers) follows exactly the same pattern as Jamba (alternating Mamba layers + attention layers).

The Linear Attention family and SSM/Mamba family are rapidly converging. Understanding one means understanding the core ideas of the other. For detailed SSM/Mamba principles, see State Space Models and Mamba. For hybrid architecture design, see Hybrid Architectures.

Comparison Summary

Method	Complexity	KV Cache	Core Idea
Full MHA	O(n²d)	2 × n_heads × d_head × seq	Each head has independent Q/K/V, full attention
GQA	O(n²d)	2 × n_kv_heads × d_head × seq	Multiple Q heads share KV heads, reduce KV cache
Sliding Window	O(nwd)	2 × n_heads × d_head × w	Each token attends to previous w tokens, stacking expands receptive field
Cross Attention	O(n·m·d)	2 × n_heads × d_head × m (encoder)	Q from decoder, KV from encoder/vision
MLA	O(n²d)	latent_dim × seq (极小)	Low-rank compression of KV cache, store compressed latent
Hybrid	混合	分层不同	Mix different attention types (full + SWA / SSM)

Selection guide:

Long context + low latency → Sliding Window (Mistral approach)
Maximum KV cache compression → MLA (DeepSeek approach)
Cross-modal / Encoder-Decoder → Cross Attention
Linear complexity + fixed state → Linear Attention / GDN (replaces KV cache entirely)
Balanced approach → Hybrid (Gemma 2 approach, or alternating GDN + SWA)
General KV savings → GQA (currently the most popular compromise)