Attention Variants: From Sliding Window to MLA
Updated 2026-04-06
Standard Multi-Head Attention faces three major bottlenecks: computational complexity, KV cache memory consumption, and long-context handling. We previously saw how GQA/MQA reduces cache by sharing KV heads — here we introduce other important attention variants.
Four optimization directions:
- Sparsification: Reduce the range each token needs to attend to (Sliding Window Attention)
- KV compression: Reduce the amount of information stored at each position (MLA)
- Linearization: Remove softmax, reduce to (Linear Attention → GDN)
- Hybrid architectures: Use different strategies at different layers, combining their strengths (Hybrid Attention)
Sliding Window Attention
Core idea: each token only attends to the preceding tokens, rather than the entire sequence.
This reduces complexity from to . Does this lose global information? Not really — after stacking multiple layers, the effective receptive field at layer is . For example, Mistral 7B has 32 layers with window , giving a theoretical receptive field of tokens.
Adopters: Mistral 7B (), Mixtral 8x7B (), Gemma 2 (alternating layers).
Sliding Window also combines perfectly with Flash Attention — computation within the window uses Flash Attention’s tiling strategy efficiently, while everything outside the window is simply skipped.
Hybrid Attention
Not all layers need full attention — mixing different types of attention lets each type play to its strengths.
Common approaches:
- Gemma 2: Even layers use full attention + odd layers use sliding window attention
- Jamba (AI21): Alternating Attention layers + Mamba (SSM) layers in a 1:3 ratio
- Command-R (Cohere): Some layers use full attention + others use local attention
The key design question: where to place full attention layers? What ratio to use? Experience shows that layers closer to the output need global information more (global layers go on top), while layers closer to the input can get by with local patterns alone.
Cross Attention
All the attention variants above are self-attention (Q/K/V come from the same sequence). The key difference with cross attention is that Q comes from one sequence, while K/V come from another sequence.
Typical Use Cases
Encoder-Decoder architectures (translation, summarization):
- The decoder’s hidden state serves as Q
- The encoder’s output serves as K and V
- Each decoder token “queries” the encoder’s full input to decide which parts of the input to focus on
- Adopters: T5, BART
Multimodal architectures (image-text understanding):
- Text decoder tokens serve as Q
- Image tokens from the vision encoder output serve as K and V
- Text tokens “query” image tokens, fusing visual information
- Adopters: Flamingo, LLaVA
Compared to self-attention, cross attention has different KV cache behavior: the KV comes from the encoder’s fixed output and does not grow as generation progresses.
Multi-Latent Attention (MLA)
GQA reduces cache by sharing KV heads — but can we be more aggressive? MLA’s approach is to apply low-rank compression to the KV cache, storing a low-dimensional compressed latent instead of the full K and V.
Compression process:
Only needs to be cached (its dimension is much smaller than K+V). During inference, and are used to decompress. Even better, can be absorbed into the matrix multiplication with , avoiding explicit decompression.
The calculator below lets you compare KV cache sizes across different configurations for MHA, GQA, and MLA:
Taking DeepSeek-V2 as an example (, 128 heads, ): MLA’s KV cache is only about 5% of standard MHA.
Adopters: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1
Linear Attention and Gated Delta Net
The variants above all retain softmax — they just reduce computation range (SWA) or compress the cache (MLA). Linear Attention takes a more radical approach: remove softmax entirely, fundamentally eliminating complexity.
Core Idea: Remove Softmax, Reorder Computation
Standard Attention computes:
You must compute (an matrix) first, then softmax, then multiply by . Softmax is a row-wise normalization — it blocks matrix multiplication associativity. You cannot compute before , because softmax sits in between.
Linear Attention (Katharopoulos et al., 2020) had a key insight: replace softmax with a feature map , enabling:
is a matrix ( is the head dimension, typically 64-128), independent of sequence length . When , this reduces complexity from to — true linear complexity.
Why φ Unlocks Associativity
Softmax blocks associativity because it does two things: (1) exp() ensures non-negativity, (2) row-wise normalization (dividing by row sum). The normalization couples all elements within the same row — computing requires knowing . You cannot compute a single element independently; you must first compute the full row of ().
φ’s strategy: replace the coupled normalization with independent element-wise transforms. Replace the softmax kernel with a kernel function decomposition:
φ acts independently on each and (no need to know other keys’ values), and the dot product naturally satisfies associativity. Expanding the -th output (with normalization):
The sums in both numerator and denominator are independent of query position — they can be precomputed once (), then each query only requires an vector multiplication.
Choosing φ requires two properties: (1) non-negative output — ensures attention weights are non-negative, (2) element-wise independence — must not couple elements within the same row like softmax does. Common choices include (Katharopoulos 2020’s original choice) and . Theoretically, Random Fourier Features can approximate softmax’s exp kernel, but are computationally expensive.
Why φ is fundamentally weaker than softmax: softmax’s exp + normalization naturally produces a sharp, sparse attention distribution (large values are exponentially amplified, small values are suppressed to near-zero), letting the model focus on the most relevant few tokens. Simple φ functions lack this “winner-take-all” effect — all keys contribute roughly equally. This is the fundamental motivation behind subsequent work (RetNet/DeltaNet/GDN) adding decay, delta rule, and gating — compensating for φ’s inability to replicate softmax’s selectivity.
RNN Form: Fixed-Size State
Without softmax, Linear Attention can be written as an RNN recurrence:
is a state matrix that compresses all history. During inference, no KV cache (which grows with sequence length) is needed — just this fixed-size state.
This form is strikingly similar to State Space Models (SSM):
| Linear Attention | SSM / Mamba | |
|---|---|---|
| State update | ||
| Output | ||
| State size | matrix (fixed) | -dim vector (fixed) |
| Inference complexity | per token | per token |
They are fundamentally the same: compress history into a fixed-size state, update via linear recurrence. Mamba-2’s SSD framework rigorously proved this equivalence — structured SSMs are a form of linear attention with decay.
The Cost: Softmax Isn’t Free to Remove
Softmax provides a sparse, sharp attention distribution — it lets the model focus on a few key tokens (e.g., precisely locating a specific name in a long document). Removing softmax makes the distribution “flat,” reducing the difference in contribution across tokens.
This causes pure Linear Attention to be far weaker than standard Attention on tasks requiring precise retrieval (copying, in-context learning). This is fundamentally the same limitation as SSM/Mamba — a fixed-size state cannot precisely remember specific information from arbitrarily long sequences. See Hybrid Architectures: Why Pure SSM Isn’t Enough.
From Accumulation to Error Correction: Evolution of State Updates
Basic Linear Attention only accumulates (), never forgets. Subsequent work focuses on making state updates smarter:
Evolution:
- Basic Linear Attention (2020): Remove softmax, establish RNN form. But state only accumulates, large performance gap
- RetNet (2023, MSR): Add exponential decay factor , old information automatically fades each step. But is a fixed hyperparameter
- DeltaNet (2024, Yang et al.): Introduce the delta rule — instead of blind accumulation, first query what’s already in the state (), then write only the delta (“what’s missing”). This is an online learning rule for associative memory
- Gated Delta Net (GDN) (2024, ICLR 2025): Combines two complementary mechanisms — gating (from Mamba2) for selective forgetting + delta rule for targeted writing. Paper shows GDN outperforms Mamba2 on language modeling, in-context retrieval, and long-context tasks
Relationship to Mamba
Linear Attention and SSM/Mamba are two formulations of the same idea — the former starts from Attention (remove softmax), the latter from control theory (state space recurrence), converging at the same point.
GDN directly embodies this convergence: its paper title reads “Improving Mamba2 with Delta Rule” — combining Mamba2’s gating mechanism with linear attention’s delta rule. GDN’s hybrid architecture (alternating GDN layers + sliding window attention layers) follows exactly the same pattern as Jamba (alternating Mamba layers + attention layers).
The Linear Attention family and SSM/Mamba family are rapidly converging. Understanding one means understanding the core ideas of the other. For detailed SSM/Mamba principles, see State Space Models and Mamba. For hybrid architecture design, see Hybrid Architectures.
Comparison Summary
| Method | Complexity | KV Cache | Core Idea |
|---|---|---|---|
| Full MHA | O(n²d) | 2 × n_heads × d_head × seq | Each head has independent Q/K/V, full attention |
| GQA | O(n²d) | 2 × n_kv_heads × d_head × seq | Multiple Q heads share KV heads, reduce KV cache |
| Sliding Window | O(nwd) | 2 × n_heads × d_head × w | Each token attends to previous w tokens, stacking expands receptive field |
| Cross Attention | O(n·m·d) | 2 × n_heads × d_head × m (encoder) | Q from decoder, KV from encoder/vision |
| MLA | O(n²d) | latent_dim × seq (极小) | Low-rank compression of KV cache, store compressed latent |
| Hybrid | 混合 | 分层不同 | Mix different attention types (full + SWA / SSM) |
Selection guide:
- Long context + low latency → Sliding Window (Mistral approach)
- Maximum KV cache compression → MLA (DeepSeek approach)
- Cross-modal / Encoder-Decoder → Cross Attention
- Linear complexity + fixed state → Linear Attention / GDN (replaces KV cache entirely)
- Balanced approach → Hybrid (Gemma 2 approach, or alternating GDN + SWA)
- General KV savings → GQA (currently the most popular compromise)
Learning Path:Transformer Core Mechanisms