KV Cache Fundamentals
Updated 2026-04-13
Introduction: KV Cache Is Key to Inference Acceleration
In the previous article, we learned that LLM inference is divided into Prefill and Decode stages. The Decode stage is autoregressive token-by-token generation — each step produces only one new token.
Here lies a critical issue: in standard Self-Attention, each new token’s Query needs to compute dot products with the Keys of all preceding tokens in the sequence, then compute a weighted sum using all preceding tokens’ Values. If we recompute all tokens’ K and V from scratch at every step, the computation grows quadratically with generation length — this is unacceptable.
The core idea of KV Cache is simple: cache the already-computed Key and Value vectors, and at each step only compute K and V for the new token, then append them to the cache. This reduces the Decode stage’s computational complexity from to (for linear layers), making it standard practice in modern LLM inference systems.
The Problem: Wasted Computation Without Cache
Naive Autoregressive Generation
Consider a Decode process generating a sequence of length . Without KV Cache, at step when generating the -th token, the model needs to:
- Feed all preceding tokens into the Transformer
- For each layer, compute , , for all tokens
- Execute the full Attention computation
- Take only the last position’s output to predict the next token
In the total computation, the linear projections alone require:
where corresponds to the computation for the Q, K, V projection matrices (each token passes through , , ).
Where Is the Waste?
Key observation: the K and V of the first tokens at step are identical to those at step . Since these tokens’ inputs haven’t changed (the Decoder uses causal masking, so earlier tokens’ representations are unaffected by later tokens), their K and V projections produce exactly the same results.
In other words, every step redundantly recomputes Keys and Values that have already been calculated — this is pure computational waste.
First step has no redundant computation.
KV Cache Mechanism: Cache and Append
Core Idea
The KV Cache mechanism is very straightforward:
- Prefill stage: Process the full prompt, compute K and V for all tokens, and save the results to cache
- Each Decode step:
- Feed only the 1 new token into the model
- Compute the new token’s , , (vectors, not matrices)
- Append and to the end of the cache
- Compute Attention between and the full K Cache
- Compute weighted sum over the full V Cache using the Attention weights
Mathematical Formulation
At step :
- New token’s Query:
- New token’s Key:
- New token’s Value:
Cache update:
Attention computation:
Computation Comparison
| Without KV Cache | With KV Cache | |
|---|---|---|
| Linear projections at step | Recompute all tokens | Only 1 new token |
| Linear projection FLOPs/step | ||
| Total linear projection FLOPs | ||
| Attention FLOPs/step | ||
| Cost | No extra memory | Must cache K, V |
KV Cache trades memory for computation, reducing the total linear projection computation from to .
Deep Dive: Why Only Cache K and V?
We’ve explained how KV Cache works, but a natural question arises: Attention involves Q, K, and V — why only cache K and V, not Q? Furthermore, why not cache the attention scores directly?
Why Not Cache Q?
The key insight is that Q is “fire-and-forget,” while K and V are “persistently needed.”
At decode step , the new token’s query has exactly one purpose: compute dot products with all cached Keys to produce attention scores. Once the current step’s attention is complete, is never used by any subsequent step.
In contrast, look at how K and V are used:
- is needed by all future steps: the new query at steps all need to compute dot products with
- is needed by all future steps: every future step’s attention weighted sum must include
| Object | Cacheable? | Reason |
|---|---|---|
| K | Yes | Every future step’s new Q needs dot products with all historical Keys |
| V | Yes | Every future step’s attention weighted sum needs all historical Values |
| Q | No | Each step only uses the current token’s Q; fire-and-forget, no cross-step reuse |
An analogy: K and V are notes on a blackboard — every new student (new token) who enters must read the entire board. Q is each student’s own question — once asked, it’s done; future students don’t need to see your question.
Why Not Cache Attention Scores?
Another intuition might be: since step already computed the dot product scores between and , can we store these scores for later use?
The answer is no, because the query is different at every step:
- Step :
- Step :
Although haven’t changed, , so every score must be recomputed with the new query. The attention score is a bivariate function of Q and K — the K side is constant (hence we cache K), but the Q side is entirely new each step, making the function values (scores) completely non-reusable.
The Recursive Structure of Decode: A Dynamic Programming Perspective
Having understood “why only cache K and V,” we can appreciate KV Cache at a higher level — it is essentially memoization for autoregressive inference.
In a multi-layer Transformer, the complete data flow at decode step is:
New token x_t (embedding)
│
▼
┌─ Layer 1 ──────────────────────────────────────────────────┐
│ q_t, k_t, v_t = x_t · W_Q, W_K, W_V │
│ KV Cache[1] ← append(k_t, v_t) │
│ attn_out = softmax(q_t · K_cache^T / √d_k) · V_cache │
│ h_t^(1) = FFN(out_proj(attn_out) + x_t) │
└────────────────────────────────────────────────────────────┘
│ h_t^(1)
▼
┌─ Layer 2 ──────────────────────────────────────────────────┐
│ q_t, k_t, v_t = h_t^(1) · W_Q, W_K, W_V │
│ KV Cache[2] ← append(k_t, v_t) │
│ attn_out = softmax(q_t · K_cache^T / √d_k) · V_cache │
│ h_t^(2) = FFN(out_proj(attn_out) + h_t^(1)) │
└────────────────────────────────────────────────────────────┘
│ ... layer by layer ...
▼
┌─ Layer L → h_t^(L) ───────────────────────────────────────┐
└────────────────────────────────────────────────────────────┘
│
▼
lm_head(h_t^(L)) → logits → predict next token
Note the recursive dependency: layer ‘s K/V come from the previous layer’s output:
And itself requires the attention computation at layer . This means:
- Layer 1’s K/V can be computed directly from token embeddings
- But layer 2 and beyond depend on the previous layer’s attention output — this is a layer-by-layer recurrence
This is also why the Prefill stage cannot skip attention: even if our goal is to “fill the KV Cache,” deeper layers’ K/V can only be obtained through complete attention computation at shallower layers. Prefill is fundamentally a full forward pass, and the KV Cache is a byproduct of this process.
This recursive structure naturally maps to a Dynamic Programming (DP) framework:
| DP Concept | KV Cache Equivalent |
|---|---|
| Subproblem | K/V vectors at each (layer, position) |
| Recurrence | ; deeper layers depend on shallower attention outputs |
| No aftereffect | Causal mask guarantees: position ‘s K/V depends only on , unaffected by future tokens |
| Memoization | Store computed K/V in cache; subsequent steps look up directly |
| Complexity optimization | Total linear projections reduced from to |
The no-aftereffect property is the critical prerequisite that makes KV Cache valid: the causal mask ensures that any historical position’s K/V, once computed, never changes — regardless of what new tokens are generated in the future. This is exactly analogous to DP’s property that “solved subproblems are unaffected by subsequent decisions.”
Interactive Demo
The demo below simulates a 5-step Decode process. The initial cache contains K/V from the Prefill stage for 2 tokens. At each step, observe:
- How the new token’s Q vector computes attention scores against the entire K Cache
- How the K Cache and V Cache grow by one row each step (highlighted in green)
Current token t₃ 's Query vector dot-products with all Keys in KV Cache, then appends new K, V to cache. Cache grows from 2 rows to 3 rows.
Memory Footprint Analysis
KV Cache is one of the main sources of memory consumption in LLM inference. Understanding its memory formula is crucial.
Single Layer, Single Head KV Cache
For one Transformer layer with one attention head, caching a sequence of length :
where the factor of 2 accounts for one copy each of K and V.
Full Model KV Cache
For the complete model, let be the number of Transformer layers, the number of attention heads, and the per-head dimension ():
Since , this simplifies to:
Concrete Numbers
For LLaMA-2 7B (, , FP16):
| Sequence length | KV Cache size |
|---|---|
| 512 | MB |
| 2048 | GB |
| 4096 | GB |
For LLaMA-2 70B (, , FP16):
| Sequence length | KV Cache size |
|---|---|
| 2048 | GiB |
| 4096 | GiB |
Note: The above uses the standard MHA formula (), assuming all heads cache KV independently. In practice, LLaMA-2 70B uses GQA (), so the actual KV Cache is only of the table above (approximately 1.25 GiB at S=4096). See the GQA/MQA section below for details.
As you can see, for large models and long sequences, KV Cache memory consumption can match or even exceed the model parameters themselves. This is why KV Cache management and compression is a core direction in inference optimization.
Batch Scenario
When serving multiple concurrent requests, KV Cache grows linearly with batch size :
For example, LLaMA-2 7B with and : total KV Cache = GB. The KV Cache alone can exhaust an entire GPU’s memory.
Cache Management: Introduction to PagedAttention
Traditional KV Cache implementations face a serious problem: memory fragmentation.
Problems with the Traditional Approach
In the simplest implementation, the system pre-allocates cache space for the maximum sequence length per request. For example, with a max length of 2048, even if only 100 tokens are actually used, space for 2048 positions must be reserved. This leads to:
- Internal fragmentation: Pre-allocated but unused space is wasted
- External fragmentation: Space freed after requests complete is hard for new requests to fully utilize
- Low memory utilization: Effective utilization may be only 20-40%
PagedAttention
PagedAttention (from vLLM, paper “Efficient Memory Management for Large Language Model Serving with PagedAttention”, arXiv 2309.06180) borrows the virtual memory paging concept from operating systems:
- Divide KV Cache space into fixed-size Blocks (similar to memory pages, typically 16 or 32 tokens)
- Each request maintains a Block Table (similar to a page table), mapping logical blocks to physical blocks
- Allocate blocks on demand: a new block is only allocated when the current block is full
- When a request finishes, its blocks can be reclaimed and assigned to new requests
Key advantages:
- Eliminates fragmentation: Fixed-size block allocation and reclamation produces no fragmentation
- Dynamic growth: No need to pre-allocate maximum length; grows on demand
- High memory utilization: Effective utilization can approach 100%
- Supports complex scheduling: For example, multiple beam search candidates can share prefix KV Cache blocks
Continuous Batching
Another important technique that works alongside PagedAttention is Continuous Batching:
- Traditional Static Batching: All requests in a batch must wait for the longest one to finish before new requests can be processed
- Continuous Batching: When a request finishes, a new request is immediately inserted to fill the slot, without waiting for other requests
This keeps GPU utilization consistently high, significantly improving inference throughput. Mainstream inference engines like vLLM and TensorRT-LLM all adopt the PagedAttention + Continuous Batching combination.
Relationship with GQA/MQA: Another Way to Reduce KV Cache
The KV Cache memory bottleneck has driven researchers to seek solutions at the model architecture level.
Revisiting the KV Cache Formula
where corresponds to the KV dimensions across all attention heads. If we can reduce the number of KV heads that need caching, we directly shrink the KV Cache.
MQA (Multi-Query Attention)
MQA has all attention heads share a single set of K and V, keeping only Q multi-headed. The KV Cache shrinks to:
This is times smaller compared to MHA (e.g., 32x smaller when ).
GQA (Grouped-Query Attention)
GQA is a compromise between MHA and MQA: the Q heads are divided into groups, each sharing one set of K and V. The KV Cache becomes:
This is times smaller compared to MHA. LLaMA-2 70B uses GQA (, ), reducing KV Cache to of the original.
Comparison
| Method | KV Heads | Relative KV Cache Size | Quality Impact |
|---|---|---|---|
| MHA | (baseline) | Best | |
| GQA | () | Minimal loss | |
| MQA | Slight loss |
GQA significantly reduces KV Cache with virtually no impact on model quality, and has been adopted by mainstream models including LLaMA-2/3 and Mistral. For a more detailed introduction to GQA/MQA, see the MQA and GQA article.
Summary
| Concept | Description |
|---|---|
| KV Cache | Caches computed K and V vectors to avoid redundant computation during Decode |
| Cost without Cache | Recomputes all KV each step, total computation |
| Speedup with Cache | Computes only 1 new token’s KV per step and appends, total |
| Memory formula | |
| PagedAttention | Borrows virtual memory paging to eliminate memory fragmentation and improve utilization |
| GQA/MQA | Reduces KV head count at the architecture level, directly shrinking cache |
| Core tradeoff | KV Cache trades memory for computation — a classic space-time tradeoff |
Core intuition: KV Cache is like an ever-growing “notebook.” During autoregressive generation, each new token only needs to add one line of its own K and V at the end of the notebook, then flip through the entire notebook to decide what to attend to. Without this notebook, every step would require recomputing all previous tokens’ K and V from scratch — like having to review everything from the very first lesson before every class, which is obviously a massive waste.