Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

KV Cache Fundamentals

KV Cache Fundamentals

Updated 2026-04-13

Introduction: KV Cache Is Key to Inference Acceleration

In the previous article, we learned that LLM inference is divided into Prefill and Decode stages. The Decode stage is autoregressive token-by-token generation — each step produces only one new token.

Here lies a critical issue: in standard Self-Attention, each new token’s Query needs to compute dot products with the Keys of all preceding tokens in the sequence, then compute a weighted sum using all preceding tokens’ Values. If we recompute all tokens’ K and V from scratch at every step, the computation grows quadratically with generation length — this is unacceptable.

The core idea of KV Cache is simple: cache the already-computed Key and Value vectors, and at each step only compute K and V for the new token, then append them to the cache. This reduces the Decode stage’s computational complexity from O(N2)O(N^2) to O(N)O(N) (for linear layers), making it standard practice in modern LLM inference systems.

The Problem: O(N2)O(N^2) Wasted Computation Without Cache

Naive Autoregressive Generation

Consider a Decode process generating a sequence of length NN. Without KV Cache, at step tt when generating the tt-th token, the model needs to:

  1. Feed all tt preceding tokens into the Transformer
  2. For each layer, compute QQ, KK, VV for all tt tokens
  3. Execute the full Attention computation
  4. Take only the last position’s output to predict the next token

In the total computation, the linear projections alone require:

t=1Nt×(2×3×d2)=3d2×N(N+1)O(N2d2)\sum_{t=1}^{N} t \times (2 \times 3 \times d^2) = 3d^2 \times N(N+1) \approx O(N^2 d^2)

where 3d23d^2 corresponds to the computation for the Q, K, V projection matrices (each token passes through WQW^Q, WKW^K, WVW^V).

Where Is the Waste?

Key observation: the K and V of the first t1t-1 tokens at step tt are identical to those at step t1t-1. Since these tokens’ inputs haven’t changed (the Decoder uses causal masking, so earlier tokens’ representations are unaffected by later tokens), their K and V projections produce exactly the same results.

In other words, every step redundantly recomputes Keys and Values that have already been calculated — this is pure computational waste.

Generate "world" (Step 1)

First step has no redundant computation.

Helloworldgen worldTotal: 2 · Waste: 0 (0%)
New Redundant

KV Cache Mechanism: Cache and Append

Core Idea

The KV Cache mechanism is very straightforward:

  1. Prefill stage: Process the full prompt, compute K and V for all tokens, and save the results to cache
  2. Each Decode step:
    • Feed only the 1 new token into the model
    • Compute the new token’s qq, kk, vv (vectors, not matrices)
    • Append kk and vv to the end of the cache
    • Compute Attention between qq and the full K Cache
    • Compute weighted sum over the full V Cache using the Attention weights

Mathematical Formulation

At step tt:

  • New token’s Query: qt=xtWQR1×dkq_t = x_t W^Q \in \mathbb{R}^{1 \times d_k}
  • New token’s Key: kt=xtWKR1×dkk_t = x_t W^K \in \mathbb{R}^{1 \times d_k}
  • New token’s Value: vt=xtWVR1×dvv_t = x_t W^V \in \mathbb{R}^{1 \times d_v}

Cache update:

Kcacheconcat(Kcache, kt)Rt×dkK_{\text{cache}} \leftarrow \text{concat}(K_{\text{cache}},\ k_t) \in \mathbb{R}^{t \times d_k} Vcacheconcat(Vcache, vt)Rt×dvV_{\text{cache}} \leftarrow \text{concat}(V_{\text{cache}},\ v_t) \in \mathbb{R}^{t \times d_v}

Attention computation:

scores=qtKcacheT/dkR1×t\text{scores} = q_t \cdot K_{\text{cache}}^T / \sqrt{d_k} \in \mathbb{R}^{1 \times t} outputt=softmax(scores)VcacheR1×dv\text{output}_t = \text{softmax}(\text{scores}) \cdot V_{\text{cache}} \in \mathbb{R}^{1 \times d_v}

Computation Comparison

Without KV CacheWith KV Cache
Linear projections at step ttRecompute all tt tokensOnly 1 new token
Linear projection FLOPs/step2×3×t×d22 \times 3 \times t \times d^22×3×d22 \times 3 \times d^2
Total linear projection FLOPsO(N2d2)O(N^2 d^2)O(Nd2)O(Nd^2)
Attention FLOPs/stepO(t2dk)O(t^2 d_k)O(tdk)O(t \cdot d_k)
CostNo extra memoryMust cache K, V

KV Cache trades memory for computation, reducing the total linear projection computation from O(N2)O(N^2) to O(N)O(N).

Deep Dive: Why Only Cache K and V?

We’ve explained how KV Cache works, but a natural question arises: Attention involves Q, K, and V — why only cache K and V, not Q? Furthermore, why not cache the QKTQK^T attention scores directly?

Why Not Cache Q?

The key insight is that Q is “fire-and-forget,” while K and V are “persistently needed.”

At decode step tt, the new token’s query qtq_t has exactly one purpose: compute dot products with all cached Keys to produce attention scores. Once the current step’s attention is complete, qtq_t is never used by any subsequent step.

In contrast, look at how K and V are used:

  • ktk_t is needed by all future steps: the new query at steps t+1,t+2,...,Nt+1, t+2, ..., N all need to compute dot products with ktk_t
  • vtv_t is needed by all future steps: every future step’s attention weighted sum must include vtv_t
ObjectCacheable?Reason
KYesEvery future step’s new Q needs dot products with all historical Keys
VYesEvery future step’s attention weighted sum needs all historical Values
QNoEach step only uses the current token’s Q; fire-and-forget, no cross-step reuse

An analogy: K and V are notes on a blackboard — every new student (new token) who enters must read the entire board. Q is each student’s own question — once asked, it’s done; future students don’t need to see your question.

Why Not Cache QKTQK^T Attention Scores?

Another intuition might be: since step tt already computed the dot product scores between qtq_t and k1,...,ktk_1, ..., k_t, can we store these scores for later use?

The answer is no, because the query is different at every step:

  • Step tt: scores=[qtk1, qtk2, ..., qtkt]\text{scores} = [q_t \cdot k_1,\ q_t \cdot k_2,\ ...,\ q_t \cdot k_t]
  • Step t+1t+1: scores=[qt+1k1, qt+1k2, ..., qt+1kt+1]\text{scores} = [q_{t+1} \cdot k_1,\ q_{t+1} \cdot k_2,\ ...,\ q_{t+1} \cdot k_{t+1}]

Although k1,...,ktk_1, ..., k_t haven’t changed, qt+1qtq_{t+1} \neq q_t, so every score must be recomputed with the new query. The attention score is a bivariate function of Q and K — the K side is constant (hence we cache K), but the Q side is entirely new each step, making the function values (scores) completely non-reusable.

The Recursive Structure of Decode: A Dynamic Programming Perspective

Having understood “why only cache K and V,” we can appreciate KV Cache at a higher level — it is essentially memoization for autoregressive inference.

In a multi-layer Transformer, the complete data flow at decode step tt is:

New token x_t (embedding)


┌─ Layer 1 ──────────────────────────────────────────────────┐
│  q_t, k_t, v_t = x_t · W_Q, W_K, W_V                     │
│  KV Cache[1] ← append(k_t, v_t)                           │
│  attn_out = softmax(q_t · K_cache^T / √d_k) · V_cache     │
│  h_t^(1) = FFN(out_proj(attn_out) + x_t)                  │
└────────────────────────────────────────────────────────────┘
  │ h_t^(1)

┌─ Layer 2 ──────────────────────────────────────────────────┐
│  q_t, k_t, v_t = h_t^(1) · W_Q, W_K, W_V                 │
│  KV Cache[2] ← append(k_t, v_t)                           │
│  attn_out = softmax(q_t · K_cache^T / √d_k) · V_cache     │
│  h_t^(2) = FFN(out_proj(attn_out) + h_t^(1))              │
└────────────────────────────────────────────────────────────┘
  │ ... layer by layer ...

┌─ Layer L → h_t^(L) ───────────────────────────────────────┐
└────────────────────────────────────────────────────────────┘


lm_head(h_t^(L)) → logits → predict next token

Note the recursive dependency: layer \ell‘s K/V come from the previous layer’s output:

kt()=ht(1)WK(),vt()=ht(1)WV()k_t^{(\ell)} = h_t^{(\ell-1)} \cdot W_K^{(\ell)}, \quad v_t^{(\ell)} = h_t^{(\ell-1)} \cdot W_V^{(\ell)}

And ht(1)h_t^{(\ell-1)} itself requires the attention computation at layer 1\ell-1. This means:

  • Layer 1’s K/V can be computed directly from token embeddings
  • But layer 2 and beyond depend on the previous layer’s attention output — this is a layer-by-layer recurrence

This is also why the Prefill stage cannot skip attention: even if our goal is to “fill the KV Cache,” deeper layers’ K/V can only be obtained through complete attention computation at shallower layers. Prefill is fundamentally a full forward pass, and the KV Cache is a byproduct of this process.

This recursive structure naturally maps to a Dynamic Programming (DP) framework:

DP ConceptKV Cache Equivalent
SubproblemK/V vectors at each (layer, position)
Recurrencekt()=f(ht(1))k_t^{(\ell)} = f(h_t^{(\ell-1)}); deeper layers depend on shallower attention outputs
No aftereffectCausal mask guarantees: position ii‘s K/V depends only on 1,...,i1, ..., i, unaffected by future tokens
MemoizationStore computed K/V in cache; subsequent steps look up directly
Complexity optimizationTotal linear projections reduced from O(N2d2)O(N^2 d^2) to O(Nd2)O(N d^2)

The no-aftereffect property is the critical prerequisite that makes KV Cache valid: the causal mask ensures that any historical position’s K/V, once computed, never changes — regardless of what new tokens are generated in the future. This is exactly analogous to DP’s property that “solved subproblems are unaffected by subsequent decisions.”

Interactive Demo

The demo below simulates a 5-step Decode process. The initial cache contains K/V from the Prefill stage for 2 tokens. At each step, observe:

  • How the new token’s Q vector computes attention scores against the entire K Cache
  • How the K Cache and V Cache grow by one row each step (highlighted in green)
Decode Step 1 — Generate t₃

Current token t₃ 's Query vector dot-products with all Keys in KV Cache, then appends new K, V to cache. Cache grows from 2 rows to 3 rows.

New Token Query Vector
q (t₃)
d₁
d₂
d₃
0.63
0.06
0.25
(1, 3)
K Cache (After Append)
d₁
d₂
d₃
t₁
0.70
0.55
0.40
t₂
0.92
0.89
-0.88
t₃
-0.45
0.08
0.55
(3, 3)
V Cache (After Append)
d₁
d₂
d₃
t₁
-0.97
-0.19
0.97
t₂
0.09
-0.49
0.58
t₃
0.09
0.83
-0.68
(3, 3)
Attention Scores (q · K^T / √d_k → softmax)
Scaled Scores
t₁
t₂
t₃
0.33
0.24
-0.08
Attention Weights
t₁
t₂
t₃
0.39
0.35
0.26
Output = Weights × V Cache
output (t₃)
d₁
d₂
d₃
-0.32
-0.03
0.40
(1, 3)
Green highlighted rowsare newly appended K/V cache rows in this step. No need to recompute previous token K, V — they are already cached.KV Cache Size: 23 rows.

Memory Footprint Analysis

KV Cache is one of the main sources of memory consumption in LLM inference. Understanding its memory formula is crucial.

Single Layer, Single Head KV Cache

For one Transformer layer with one attention head, caching a sequence of length SS:

Single-head KV Cache=2×S×dk×dtype_size\text{Single-head KV Cache} = 2 \times S \times d_k \times \text{dtype\_size}

where the factor of 2 accounts for one copy each of K and V.

Full Model KV Cache

For the complete model, let LL be the number of Transformer layers, nhn_h the number of attention heads, and dkd_k the per-head dimension (dk=dmodel/nhd_k = d_{\text{model}} / n_h):

KV Cache=2×L×nh×S×dk×dtype_size\text{KV Cache} = 2 \times L \times n_h \times S \times d_k \times \text{dtype\_size}

Since nh×dk=dmodeln_h \times d_k = d_{\text{model}}, this simplifies to:

KV Cache=2×L×dmodel×S×dtype_size\boxed{\text{KV Cache} = 2 \times L \times d_{\text{model}} \times S \times \text{dtype\_size}}

Concrete Numbers

For LLaMA-2 7B (L=32L=32, dmodel=4096d_{\text{model}}=4096, FP16):

Sequence length SSKV Cache size
5122×32×4096×512×2=2562 \times 32 \times 4096 \times 512 \times 2 = 256 MB
20482×32×4096×2048×2=12 \times 32 \times 4096 \times 2048 \times 2 = 1 GB
40962×32×4096×4096×2=22 \times 32 \times 4096 \times 4096 \times 2 = 2 GB

For LLaMA-2 70B (L=80L=80, dmodel=8192d_{\text{model}}=8192, FP16):

Sequence length SSKV Cache size
20482×80×8192×2048×2=5.02 \times 80 \times 8192 \times 2048 \times 2 = 5.0 GiB
40962×80×8192×4096×2=10.02 \times 80 \times 8192 \times 4096 \times 2 = 10.0 GiB

Note: The above uses the standard MHA formula (nh=64n_h = 64), assuming all heads cache KV independently. In practice, LLaMA-2 70B uses GQA (nkv=8n_{kv} = 8), so the actual KV Cache is only 1/81/8 of the table above (approximately 1.25 GiB at S=4096). See the GQA/MQA section below for details.

As you can see, for large models and long sequences, KV Cache memory consumption can match or even exceed the model parameters themselves. This is why KV Cache management and compression is a core direction in inference optimization.

Single Request KV Cache
1.000 GB
1 Total Occupancy
1.00 GB
GPU Memory
1.3%
A100 80GB: 80 GB Total Memory
Formula: 2 × L(32) × kv_heads(32) × S(2,048) × d_k(128) × 2B

Batch Scenario

When serving multiple concurrent requests, KV Cache grows linearly with batch size BB:

Total KV Cache=B×2×L×dmodel×S×dtype_size\text{Total KV Cache} = B \times 2 \times L \times d_{\text{model}} \times S \times \text{dtype\_size}

For example, LLaMA-2 7B with S=2048S=2048 and B=32B=32: total KV Cache = 32×1 GB=3232 \times 1\text{ GB} = 32 GB. The KV Cache alone can exhaust an entire GPU’s memory.

Cache Management: Introduction to PagedAttention

Traditional KV Cache implementations face a serious problem: memory fragmentation.

Problems with the Traditional Approach

In the simplest implementation, the system pre-allocates cache space for the maximum sequence length per request. For example, with a max length of 2048, even if only 100 tokens are actually used, space for 2048 positions must be reserved. This leads to:

  • Internal fragmentation: Pre-allocated but unused space is wasted
  • External fragmentation: Space freed after requests complete is hard for new requests to fully utilize
  • Low memory utilization: Effective utilization may be only 20-40%

PagedAttention

PagedAttention (from vLLM, paper “Efficient Memory Management for Large Language Model Serving with PagedAttention”, arXiv 2309.06180) borrows the virtual memory paging concept from operating systems:

  1. Divide KV Cache space into fixed-size Blocks (similar to memory pages, typically 16 or 32 tokens)
  2. Each request maintains a Block Table (similar to a page table), mapping logical blocks to physical blocks
  3. Allocate blocks on demand: a new block is only allocated when the current block is full
  4. When a request finishes, its blocks can be reclaimed and assigned to new requests

Key advantages:

  • Eliminates fragmentation: Fixed-size block allocation and reclamation produces no fragmentation
  • Dynamic growth: No need to pre-allocate maximum length; grows on demand
  • High memory utilization: Effective utilization can approach 100%
  • Supports complex scheduling: For example, multiple beam search candidates can share prefix KV Cache blocks
Traditional Pre-allocation
GPU Memory
PagedAttention
GPU Memory
Request A allocates space
Request A Request B Internal Frag External Frag

Continuous Batching

Another important technique that works alongside PagedAttention is Continuous Batching:

  • Traditional Static Batching: All requests in a batch must wait for the longest one to finish before new requests can be processed
  • Continuous Batching: When a request finishes, a new request is immediately inserted to fill the slot, without waiting for other requests

This keeps GPU utilization consistently high, significantly improving inference throughput. Mainstream inference engines like vLLM and TensorRT-LLM all adopt the PagedAttention + Continuous Batching combination.

Continuous Batching Timeline3 GPU slots, dynamic entry/exit, released upon completionSlot 0Slot 1Slot 201234567891011121314ABCDED arrivalE arrivalPrefill (dark)Decode (light)New request arrivalRequests A/C complete, slots immediately filled by D/E — no idle wait, max GPU utilization

Relationship with GQA/MQA: Another Way to Reduce KV Cache

The KV Cache memory bottleneck has driven researchers to seek solutions at the model architecture level.

Revisiting the KV Cache Formula

KV Cache=2×L×nh×dkdmodel×S×dtype_size\text{KV Cache} = 2 \times L \times \underbrace{n_h \times d_k}_{d_{\text{model}}} \times S \times \text{dtype\_size}

where nh×dkn_h \times d_k corresponds to the KV dimensions across all attention heads. If we can reduce the number of KV heads that need caching, we directly shrink the KV Cache.

MQA (Multi-Query Attention)

MQA has all attention heads share a single set of K and V, keeping only Q multi-headed. The KV Cache shrinks to:

KV CacheMQA=2×L×1×dk×S×dtype_size\text{KV Cache}_{\text{MQA}} = 2 \times L \times 1 \times d_k \times S \times \text{dtype\_size}

This is nhn_h times smaller compared to MHA (e.g., 32x smaller when nh=32n_h=32).

GQA (Grouped-Query Attention)

GQA is a compromise between MHA and MQA: the nhn_h Q heads are divided into gg groups, each sharing one set of K and V. The KV Cache becomes:

KV CacheGQA=2×L×g×dk×S×dtype_size\text{KV Cache}_{\text{GQA}} = 2 \times L \times g \times d_k \times S \times \text{dtype\_size}

This is nh/gn_h/g times smaller compared to MHA. LLaMA-2 70B uses GQA (nh=64n_h=64, g=8g=8), reducing KV Cache to 1/81/8 of the original.

Comparison

MethodKV HeadsRelative KV Cache SizeQuality Impact
MHAnhn_h1×1\times (baseline)Best
GQAgg (1<g<nh1 < g < n_h)g/nhg/n_hMinimal loss
MQA111/nh1/n_hSlight loss

GQA significantly reduces KV Cache with virtually no impact on model quality, and has been adopted by mainstream models including LLaMA-2/3 and Mistral. For a more detailed introduction to GQA/MQA, see the MQA and GQA article.

Summary

ConceptDescription
KV CacheCaches computed K and V vectors to avoid redundant computation during Decode
Cost without CacheRecomputes all KV each step, total computation O(N2d2)O(N^2 d^2)
Speedup with CacheComputes only 1 new token’s KV per step and appends, total O(Nd2)O(Nd^2)
Memory formula2×L×dmodel×S×dtype_size2 \times L \times d_{\text{model}} \times S \times \text{dtype\_size}
PagedAttentionBorrows virtual memory paging to eliminate memory fragmentation and improve utilization
GQA/MQAReduces KV head count at the architecture level, directly shrinking cache
Core tradeoffKV Cache trades memory for computation — a classic space-time tradeoff

Core intuition: KV Cache is like an ever-growing “notebook.” During autoregressive generation, each new token only needs to add one line of its own K and V at the end of the notebook, then flip through the entire notebook to decide what to attend to. Without this notebook, every step would require recomputing all previous tokens’ K and V from scratch — like having to review everything from the very first lesson before every class, which is obviously a massive waste.