KV Cache Fundamentals | LLM Learning

Introduction: KV Cache Is Key to Inference Acceleration

In the previous article, we learned that LLM inference is divided into Prefill and Decode stages. The Decode stage is autoregressive token-by-token generation — each step produces only one new token.

Here lies a critical issue: in standard Self-Attention, each new token’s Query needs to compute dot products with the Keys of all preceding tokens in the sequence, then compute a weighted sum using all preceding tokens’ Values. If we recompute all tokens’ K and V from scratch at every step, the computation grows quadratically with generation length — this is unacceptable.

The core idea of KV Cache is simple: cache the already-computed Key and Value vectors, and at each step only compute K and V for the new token, then append them to the cache. This reduces the Decode stage’s computational complexity from $O(N^2)$ to $O(N)$ (for linear layers), making it standard practice in modern LLM inference systems.

The Problem: $O(N^2)$ Wasted Computation Without Cache

Naive Autoregressive Generation

Consider a Decode process generating a sequence of length $N$ . Without KV Cache, at step $t$ when generating the $t$ -th token, the model needs to:

Feed all $t$ preceding tokens into the Transformer
For each layer, compute $Q$ , $K$ , $V$ for all $t$ tokens
Execute the full Attention computation
Take only the last position’s output to predict the next token

In the total computation, the linear projections alone require:

\sum_{t=1}^{N} t \times (2 \times 3 \times d^2) = 3d^2 \times N(N+1) \approx O(N^2 d^2)

where $3d^2$ corresponds to the computation for the Q, K, V projection matrices (each token passes through $W^Q$ , $W^K$ , $W^V$ ).

Where Is the Waste?

Key observation: the K and V of the first $t-1$ tokens at step $t$ are identical to those at step $t-1$ . Since these tokens’ inputs haven’t changed (the Decoder uses causal masking, so earlier tokens’ representations are unaffected by later tokens), their K and V projections produce exactly the same results.

In other words, every step redundantly recomputes Keys and Values that have already been calculated — this is pure computational waste.

Generate "world" (Step 1)

First step has no redundant computation.

New Redundant

KV Cache Mechanism: Cache and Append

Core Idea

The KV Cache mechanism is very straightforward:

Prefill stage: Process the full prompt, compute K and V for all tokens, and save the results to cache
Each Decode step:
- Feed only the 1 new token into the model
- Compute the new token’s $q$ , $k$ , $v$ (vectors, not matrices)
- Append $k$ and $v$ to the end of the cache
- Compute Attention between $q$ and the full K Cache
- Compute weighted sum over the full V Cache using the Attention weights

Mathematical Formulation

At step $t$ :

New token’s Query: $q_t = x_t W^Q \in \mathbb{R}^{1 \times d_k}$
New token’s Key: $k_t = x_t W^K \in \mathbb{R}^{1 \times d_k}$
New token’s Value: $v_t = x_t W^V \in \mathbb{R}^{1 \times d_v}$

Cache update:

K_{\text{cache}} \leftarrow \text{concat}(K_{\text{cache}},\ k_t) \in \mathbb{R}^{t \times d_k}

V_{\text{cache}} \leftarrow \text{concat}(V_{\text{cache}},\ v_t) \in \mathbb{R}^{t \times d_v}

Attention computation:

\text{scores} = q_t \cdot K_{\text{cache}}^T / \sqrt{d_k} \in \mathbb{R}^{1 \times t}

\text{output}_t = \text{softmax}(\text{scores}) \cdot V_{\text{cache}} \in \mathbb{R}^{1 \times d_v}

Computation Comparison

	Without KV Cache	With KV Cache
Linear projections at step $t$	Recompute all $t$ tokens	Only 1 new token
Linear projection FLOPs/step	$2 \times 3 \times t \times d^2$	$2 \times 3 \times d^2$
Total linear projection FLOPs	$O(N^2 d^2)$	$O(Nd^2)$
Attention FLOPs/step	$O(t^2 d_k)$	$O(t \cdot d_k)$
Cost	No extra memory	Must cache K, V

KV Cache trades memory for computation, reducing the total linear projection computation from $O(N^2)$ to $O(N)$ .

Deep Dive: Why Only Cache K and V?

We’ve explained how KV Cache works, but a natural question arises: Attention involves Q, K, and V — why only cache K and V, not Q? Furthermore, why not cache the $QK^T$ attention scores directly?

Why Not Cache Q?

The key insight is that Q is “fire-and-forget,” while K and V are “persistently needed.”

At decode step $t$ , the new token’s query $q_t$ has exactly one purpose: compute dot products with all cached Keys to produce attention scores. Once the current step’s attention is complete, $q_t$ is never used by any subsequent step.

In contrast, look at how K and V are used:

$k_t$ is needed by all future steps: the new query at steps $t+1, t+2, ..., N$ all need to compute dot products with $k_t$
$v_t$ is needed by all future steps: every future step’s attention weighted sum must include $v_t$

Object	Cacheable?	Reason
K	Yes	Every future step’s new Q needs dot products with all historical Keys
V	Yes	Every future step’s attention weighted sum needs all historical Values
Q	No	Each step only uses the current token’s Q; fire-and-forget, no cross-step reuse

An analogy: K and V are notes on a blackboard — every new student (new token) who enters must read the entire board. Q is each student’s own question — once asked, it’s done; future students don’t need to see your question.

Why Not Cache $QK^T$ Attention Scores?

Another intuition might be: since step $t$ already computed the dot product scores between $q_t$ and $k_1, ..., k_t$ , can we store these scores for later use?

The answer is no, because the query is different at every step:

Step $t$ : $\text{scores} = [q_t \cdot k_1,\ q_t \cdot k_2,\ ...,\ q_t \cdot k_t]$
Step $t+1$ : $\text{scores} = [q_{t+1} \cdot k_1,\ q_{t+1} \cdot k_2,\ ...,\ q_{t+1} \cdot k_{t+1}]$

Although $k_1, ..., k_t$ haven’t changed, $q_{t+1} \neq q_t$ , so every score must be recomputed with the new query. The attention score is a bivariate function of Q and K — the K side is constant (hence we cache K), but the Q side is entirely new each step, making the function values (scores) completely non-reusable.

The Recursive Structure of Decode: A Dynamic Programming Perspective

Having understood “why only cache K and V,” we can appreciate KV Cache at a higher level — it is essentially memoization for autoregressive inference.

In a multi-layer Transformer, the complete data flow at decode step $t$ is:

New token x_t (embedding)
  │
  ▼
┌─ Layer 1 ──────────────────────────────────────────────────┐
│  q_t, k_t, v_t = x_t · W_Q, W_K, W_V                     │
│  KV Cache[1] ← append(k_t, v_t)                           │
│  attn_out = softmax(q_t · K_cache^T / √d_k) · V_cache     │
│  h_t^(1) = FFN(out_proj(attn_out) + x_t)                  │
└────────────────────────────────────────────────────────────┘
  │ h_t^(1)
  ▼
┌─ Layer 2 ──────────────────────────────────────────────────┐
│  q_t, k_t, v_t = h_t^(1) · W_Q, W_K, W_V                 │
│  KV Cache[2] ← append(k_t, v_t)                           │
│  attn_out = softmax(q_t · K_cache^T / √d_k) · V_cache     │
│  h_t^(2) = FFN(out_proj(attn_out) + h_t^(1))              │
└────────────────────────────────────────────────────────────┘
  │ ... layer by layer ...
  ▼
┌─ Layer L → h_t^(L) ───────────────────────────────────────┐
└────────────────────────────────────────────────────────────┘
  │
  ▼
lm_head(h_t^(L)) → logits → predict next token

Note the recursive dependency: layer $\ell$ ‘s K/V come from the previous layer’s output:

k_t^{(\ell)} = h_t^{(\ell-1)} \cdot W_K^{(\ell)}, \quad v_t^{(\ell)} = h_t^{(\ell-1)} \cdot W_V^{(\ell)}

And $h_t^{(\ell-1)}$ itself requires the attention computation at layer $\ell-1$ . This means:

Layer 1’s K/V can be computed directly from token embeddings
But layer 2 and beyond depend on the previous layer’s attention output — this is a layer-by-layer recurrence

This is also why the Prefill stage cannot skip attention: even if our goal is to “fill the KV Cache,” deeper layers’ K/V can only be obtained through complete attention computation at shallower layers. Prefill is fundamentally a full forward pass, and the KV Cache is a byproduct of this process.

This recursive structure naturally maps to a Dynamic Programming (DP) framework:

DP Concept	KV Cache Equivalent
Subproblem	K/V vectors at each (layer, position)
Recurrence	$k_t^{(\ell)} = f(h_t^{(\ell-1)})$ ; deeper layers depend on shallower attention outputs
No aftereffect	Causal mask guarantees: position $i$ ‘s K/V depends only on $1, ..., i$ , unaffected by future tokens
Memoization	Store computed K/V in cache; subsequent steps look up directly
Complexity optimization	Total linear projections reduced from $O(N^2 d^2)$ to $O(N d^2)$

The no-aftereffect property is the critical prerequisite that makes KV Cache valid: the causal mask ensures that any historical position’s K/V, once computed, never changes — regardless of what new tokens are generated in the future. This is exactly analogous to DP’s property that “solved subproblems are unaffected by subsequent decisions.”

Interactive Demo

The demo below simulates a 5-step Decode process. The initial cache contains K/V from the Prefill stage for 2 tokens. At each step, observe:

How the new token’s Q vector computes attention scores against the entire K Cache
How the K Cache and V Cache grow by one row each step (highlighted in green)

Decode Step 1 — Generate t₃

Current token t₃ 's Query vector dot-products with all Keys in KV Cache, then appends new K, V to cache. Cache grows from 2 rows to 3 rows.

New Token Query Vector

q (t₃)

d₁

d₂

d₃

0.63

0.06

0.25

(1, 3)

K Cache (After Append)

d₁

d₂

d₃

t₁

0.70

0.55

0.40

t₂

0.92

0.89

-0.88

t₃

-0.45

0.08

0.55

(3, 3)

V Cache (After Append)

d₁

d₂

d₃

t₁

-0.97

-0.19

0.97

t₂

0.09

-0.49

0.58

t₃

0.09

0.83

-0.68

(3, 3)

Attention Scores (q · K^T / √d_k → softmax)

Scaled Scores

t₁

t₂

t₃

0.33

0.24

-0.08

→

Attention Weights

t₁

t₂

t₃

0.39

0.35

0.26

Output = Weights × V Cache

output (t₃)

d₁

d₂

d₃

-0.32

-0.03

0.40

(1, 3)

Green highlighted rowsare newly appended K/V cache rows in this step. No need to recompute previous token K, V — they are already cached.KV Cache Size: 2 → 3 rows.

Memory Footprint Analysis

KV Cache is one of the main sources of memory consumption in LLM inference. Understanding its memory formula is crucial.

Single Layer, Single Head KV Cache

For one Transformer layer with one attention head, caching a sequence of length $S$ :

\text{Single-head KV Cache} = 2 \times S \times d_k \times \text{dtype\_size}

where the factor of 2 accounts for one copy each of K and V.

Full Model KV Cache

For the complete model, let $L$ be the number of Transformer layers, $n_h$ the number of attention heads, and $d_k$ the per-head dimension ( $d_k = d_{\text{model}} / n_h$ ):

\text{KV Cache} = 2 \times L \times n_h \times S \times d_k \times \text{dtype\_size}

Since $n_h \times d_k = d_{\text{model}}$ , this simplifies to:

\boxed{\text{KV Cache} = 2 \times L \times d_{\text{model}} \times S \times \text{dtype\_size}}

Concrete Numbers

For LLaMA-2 7B ( $L=32$ , $d_{\text{model}}=4096$ , FP16):

Sequence length $S$	KV Cache size
512	$2 \times 32 \times 4096 \times 512 \times 2 = 256$ MB
2048	$2 \times 32 \times 4096 \times 2048 \times 2 = 1$ GB
4096	$2 \times 32 \times 4096 \times 4096 \times 2 = 2$ GB

For LLaMA-2 70B ( $L=80$ , $d_{\text{model}}=8192$ , FP16):

Sequence length $S$	KV Cache size
2048	$2 \times 80 \times 8192 \times 2048 \times 2 = 5.0$ GiB
4096	$2 \times 80 \times 8192 \times 4096 \times 2 = 10.0$ GiB

Note: The above uses the standard MHA formula ( $n_h = 64$ ), assuming all heads cache KV independently. In practice, LLaMA-2 70B uses GQA ( $n_{kv} = 8$ ), so the actual KV Cache is only $1/8$ of the table above (approximately 1.25 GiB at S=4096). See the GQA/MQA section below for details.

As you can see, for large models and long sequences, KV Cache memory consumption can match or even exceed the model parameters themselves. This is why KV Cache management and compression is a core direction in inference optimization.

Model Preset

Sequence Length S: 2,048

Precision

Concurrent Requests N: 1

GPU

Single Request KV Cache

1.000 GB

1 Total Occupancy

1.00 GB

GPU Memory

1.3%

A100 80GB: 80 GB Total Memory

Formula: 2 × L(32) × kv_heads(32) × S(2,048) × d_k(128) × 2B

Batch Scenario

When serving multiple concurrent requests, KV Cache grows linearly with batch size $B$ :

\text{Total KV Cache} = B \times 2 \times L \times d_{\text{model}} \times S \times \text{dtype\_size}

For example, LLaMA-2 7B with $S=2048$ and $B=32$ : total KV Cache = $32 \times 1\text{ GB} = 32$ GB. The KV Cache alone can exhaust an entire GPU’s memory.

Cache Management: Introduction to PagedAttention

Traditional KV Cache implementations face a serious problem: memory fragmentation.

Problems with the Traditional Approach

In the simplest implementation, the system pre-allocates cache space for the maximum sequence length per request. For example, with a max length of 2048, even if only 100 tokens are actually used, space for 2048 positions must be reserved. This leads to:

Internal fragmentation: Pre-allocated but unused space is wasted
External fragmentation: Space freed after requests complete is hard for new requests to fully utilize
Low memory utilization: Effective utilization may be only 20-40%

PagedAttention

PagedAttention (from vLLM, paper “Efficient Memory Management for Large Language Model Serving with PagedAttention”, arXiv 2309.06180) borrows the virtual memory paging concept from operating systems:

Divide KV Cache space into fixed-size Blocks (similar to memory pages, typically 16 or 32 tokens)
Each request maintains a Block Table (similar to a page table), mapping logical blocks to physical blocks
Allocate blocks on demand: a new block is only allocated when the current block is full
When a request finishes, its blocks can be reclaimed and assigned to new requests

Key advantages:

Eliminates fragmentation: Fixed-size block allocation and reclamation produces no fragmentation
Dynamic growth: No need to pre-allocate maximum length; grows on demand
High memory utilization: Effective utilization can approach 100%
Supports complex scheduling: For example, multiple beam search candidates can share prefix KV Cache blocks

Traditional Pre-allocation

GPU Memory

PagedAttention

GPU Memory

Request A allocates space

Request A Request B Internal Frag External Frag

Continuous Batching

Another important technique that works alongside PagedAttention is Continuous Batching:

Traditional Static Batching: All requests in a batch must wait for the longest one to finish before new requests can be processed
Continuous Batching: When a request finishes, a new request is immediately inserted to fill the slot, without waiting for other requests

This keeps GPU utilization consistently high, significantly improving inference throughput. Mainstream inference engines like vLLM and TensorRT-LLM all adopt the PagedAttention + Continuous Batching combination.

Relationship with GQA/MQA: Another Way to Reduce KV Cache

The KV Cache memory bottleneck has driven researchers to seek solutions at the model architecture level.

Revisiting the KV Cache Formula

\text{KV Cache} = 2 \times L \times \underbrace{n_h \times d_k}_{d_{\text{model}}} \times S \times \text{dtype\_size}

where $n_h \times d_k$ corresponds to the KV dimensions across all attention heads. If we can reduce the number of KV heads that need caching, we directly shrink the KV Cache.

MQA (Multi-Query Attention)

MQA has all attention heads share a single set of K and V, keeping only Q multi-headed. The KV Cache shrinks to:

\text{KV Cache}_{\text{MQA}} = 2 \times L \times 1 \times d_k \times S \times \text{dtype\_size}

This is $n_h$ times smaller compared to MHA (e.g., 32x smaller when $n_h=32$ ).

GQA (Grouped-Query Attention)

GQA is a compromise between MHA and MQA: the $n_h$ Q heads are divided into $g$ groups, each sharing one set of K and V. The KV Cache becomes:

\text{KV Cache}_{\text{GQA}} = 2 \times L \times g \times d_k \times S \times \text{dtype\_size}

This is $n_h/g$ times smaller compared to MHA. LLaMA-2 70B uses GQA ( $n_h=64$ , $g=8$ ), reducing KV Cache to $1/8$ of the original.

Comparison

Method	KV Heads	Relative KV Cache Size	Quality Impact
MHA	$n_h$	$1\times$ (baseline)	Best
GQA	$g$ ( $1 < g < n_h$ )	$g/n_h$	Minimal loss
MQA	$1$	$1/n_h$	Slight loss

GQA significantly reduces KV Cache with virtually no impact on model quality, and has been adopted by mainstream models including LLaMA-2/3 and Mistral. For a more detailed introduction to GQA/MQA, see the MQA and GQA article.

Summary

Concept	Description
KV Cache	Caches computed K and V vectors to avoid redundant computation during Decode
Cost without Cache	Recomputes all KV each step, total computation $O(N^2 d^2)$
Speedup with Cache	Computes only 1 new token’s KV per step and appends, total $O(Nd^2)$
Memory formula	$2 \times L \times d_{\text{model}} \times S \times \text{dtype\_size}$
PagedAttention	Borrows virtual memory paging to eliminate memory fragmentation and improve utilization
GQA/MQA	Reduces KV head count at the architecture level, directly shrinking cache
Core tradeoff	KV Cache trades memory for computation — a classic space-time tradeoff

Core intuition: KV Cache is like an ever-growing “notebook.” During autoregressive generation, each new token only needs to add one line of its own K and V at the end of the notebook, then flip through the entire notebook to decide what to attend to. Without this notebook, every step would require recomputing all previous tokens’ K and V from scratch — like having to review everything from the very first lesson before every class, which is obviously a massive waste.

Introduction: KV Cache Is Key to Inference Acceleration

The Problem: O(N2)O(N^2)O(N2) Wasted Computation Without Cache

Naive Autoregressive Generation

Where Is the Waste?

KV Cache Mechanism: Cache and Append

Core Idea

Mathematical Formulation

Computation Comparison

Deep Dive: Why Only Cache K and V?

Why Not Cache Q?

Why Not Cache QKTQK^TQKT Attention Scores?

The Recursive Structure of Decode: A Dynamic Programming Perspective

Interactive Demo

Memory Footprint Analysis

Single Layer, Single Head KV Cache

Full Model KV Cache

Concrete Numbers

Batch Scenario

Cache Management: Introduction to PagedAttention

Problems with the Traditional Approach

PagedAttention

Continuous Batching

Relationship with GQA/MQA: Another Way to Reduce KV Cache

Revisiting the KV Cache Formula

MQA (Multi-Query Attention)

GQA (Grouped-Query Attention)

Comparison

Summary

The Problem: $O(N^2)$ Wasted Computation Without Cache

Why Not Cache $QK^T$ Attention Scores?