Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

QKV Data Structures and Intuition

QKV Data Structures and Intuition

Updated 2026-04-06

Introduction: Why We Need QKV

In the previous article, we saw the core formula of Self-Attention:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Three matrices appear in this formula: Q (Query), K (Key), and V (Value). Where do they come from? Why do we need three different matrices instead of directly using the input itself?

The answer is: through three separate linear projections, the same input plays different roles in different subspaces. This allows the model to flexibly learn three fundamentally different functions: “what to ask,” “what to match against,” and “what information to extract.”

Intuitive Understanding: The Library Search Analogy

Imagine you walk into a library looking for information about “Transformer architecture”:

RoleAnalogyExplanation
QueryThe question in your mind”I want to learn about Transformer architecture” — you search with this question
KeyBook tags / index cardsEach book has keyword tags: “deep learning,” “attention mechanism,” “NLP,” etc.
ValueThe actual book contentAfter the tags match, what you actually read is the text inside the book

The search process works as follows:

  1. Query compared with Keys: Your question (Query) is matched against each book’s tags (Key) to compute a “relevance score”
  2. Softmax normalization: Scores are converted into a probability distribution — the most relevant books get the highest weights
  3. Weighted extraction of Values: Based on relevance weights, information is extracted from each book’s content (Value)

The key insight is: the tags (Key) and the content (Value) are different representations of the same book. Tags are used for quick matching, while the content is the information you actually need. Similarly, in Attention, Keys are used to compute attention weights, and Values are used to aggregate information — they come from the same input but serve different functions through different projections.

If we skip the projections and directly use the input XX as Q, K, and V simultaneously, the model can only compute “self-similarity with itself,” resulting in extremely limited expressiveness. Three independent projections give the model much greater flexibility.

Mathematical Definition: Linear Projection Formulas and Dimensions

Input

Let the input sequence matrix after Embedding be XX:

XRS×HX \in \mathbb{R}^{S \times H}

Where:

  • SS is the sequence length (number of tokens)
  • HH is the hidden dimension (hidden size)
X:(S=seq_len,H=hidden)

Weight Matrices

Three sets of learnable weight matrices:

WQRH×dk,WKRH×dk,WVRH×dvW_Q \in \mathbb{R}^{H \times d_k}, \quad W_K \in \mathbb{R}^{H \times d_k}, \quad W_V \in \mathbb{R}^{H \times d_v}

Where dkd_k (also commonly written as dheadd_{\text{head}}) is the dimension of each attention head. In the standard Transformer, typically dk=dvd_k = d_v.

Linear Projection

Computing Q, K, V is very straightforward — it is just matrix multiplication:

Q=XWQRS×dkQ = X W_Q \in \mathbb{R}^{S \times d_k} K=XWKRS×dkK = X W_K \in \mathbb{R}^{S \times d_k} V=XWVRS×dvV = X W_V \in \mathbb{R}^{S \times d_v}

Each token’s HH-dimensional hidden vector is projected (linearly transformed) into a new dkd_k-dimensional space. This is the so-called “linear projection.”

Why is it called “projection”? From a geometric perspective, multiplying by a weight matrix projects a vector from a high-dimensional space onto a lower-dimensional subspace. Different weight matrices define different subspaces — the Query space, Key space, and Value space.

Q Space
"What am I looking for"
catsitsonmatthe
K Space
"What can I provide"
catsitsonmatthe
V Space
"My actual content"
catsitsonmatthe
catsitsonmatthe

Step-by-Step Visualization

Below is a small example (S=4S=4, H=6H=6, dk=3d_k=3) demonstrating the complete QKV linear projection process:

Input Matrix X

Input matrix X has shape (S=4, H=6). Each row represents a hidden representation vector for one token.

X ∈ ℝ^(4×6)
h₁
h₂
h₃
h₄
h₅
h₆
t₁
0.05
0.11
0.42
0.03
0.89
0.59
t₂
0.63
0.06
0.25
0.44
0.56
0.76
t₃
0.32
0.72
0.77
0.19
0.43
0.64
t₄
0.93
0.80
0.29
0.82
0.80
0.08
(4, 6)

Tensor Shape Tracking

The dimension changes at each step from input to Q, K, V:

StepOperationShapeDescription
1Input XX(S,H)(S, H)Each row is a token’s hidden vector
2Weight matrix WQW_Q(H,dk)(H, d_k)Learnable parameters
3Q=XWQQ = X W_Q(S,dk)(S, d_k)Matrix multiplication: (S,H)×(H,dk)(S,dk)(S, H) \times (H, d_k) \to (S, d_k)
4Weight matrix WKW_K(H,dk)(H, d_k)Another set of learnable parameters
5K=XWKK = X W_K(S,dk)(S, d_k)(S,H)×(H,dk)(S,dk)(S, H) \times (H, d_k) \to (S, d_k)
6Weight matrix WVW_V(H,dv)(H, d_v)A third set of learnable parameters
7V=XWVV = X W_V(S,dv)(S, d_v)(S,H)×(H,dv)(S,dv)(S, H) \times (H, d_v) \to (S, d_v)

With the batch dimension included, the full tensor shapes are:

Input X:(B=batch,S=seq_len,H=hidden)

After projection:

Q, K, V:(B=batch,S=seq_len,d_k=head_dim)

Using GPT-2 Small as an example (H=768H=768, h=12h=12 heads):

ParameterValueDescription
HH768Hidden dimension
hh12Number of attention heads
dkd_k64Per-head dimension = 768/12768 / 12

Before projection, each token is a 768-dimensional vector; after projection, it becomes a 64-dimensional Q/K/V vector (within a single head).

QKV in the Multi-Head Setting

In Multi-Head Attention, the hidden dimension is split across hh heads:

dk=Hhd_k = \frac{H}{h}

Why set dk×h=Hd_k \times h = H? This is not a mathematical necessity but an engineering design choice:

  • Parameter conservation: The combined weight matrix is exactly RH×H\mathbb{R}^{H \times H}, keeping the parameter count the same as a single linear layer — introducing multiple heads doesn’t add parameters
  • Implementation efficiency: A single matmul (B,S,H)×(H,H)(B,S,H)(B, S, H) \times (H, H) \to (B, S, H) followed by a reshape splits out all heads. If each head had a different dkd_k, reshape wouldn’t work — you’d need uneven splits, which are less efficient
  • Clean residual connections: The Attention output must be added back to the input (dimension HH). Each head outputs dvd_v dimensions, and hh heads concatenate to h×dvh \times d_v. When dv=H/hd_v = H/h, the concatenated dimension is exactly HH, and the output projection WORH×HW_O \in \mathbb{R}^{H \times H} stays a square matrix. A different concatenated dimension could still be projected back with a differently-sized WOW_O, but this is less elegant

Notably, later architectures like GQA (Grouped Query Attention) and MQA (Multi-Query Attention) relax the assumption that Q, K, and V must have the same number of heads — MQA shares a single K/V head across all Q heads, while GQA shares K/V within groups. However, the Q projection still satisfies dk×h=Hd_k \times h = H; what changes is the K/V head count, reducing KV Cache size.

In practice, the implementation does not actually perform separate matrix multiplications for each head. Instead, a single large weight matrix projects all heads’ results at once, followed by a reshape:

WQallRH×HW_Q^{\text{all}} \in \mathbb{R}^{H \times H} Qall=XWQallRS×HQ^{\text{all}} = X W_Q^{\text{all}} \in \mathbb{R}^{S \times H}

Then QallQ^{\text{all}} is reshaped from (S,H)(S, H) to (S,h,dk)(S, h, d_k), and transposed to (h,S,dk)(h, S, d_k), so each head has its own Query matrix.

(B,S,H)WQ(B,S,H)reshape(B,S,h,dk)transpose(B,h,S,dk)(B, S, H) \xrightarrow{W_Q} (B, S, H) \xrightarrow{\text{reshape}} (B, S, h, d_k) \xrightarrow{\text{transpose}} (B, h, S, d_k)
(S, H) Original projection result

Projected matrix (S, H) = (4, 8)All elements are undifferentiated by head, using uniform color.

Q ∈ (4, 8)
t₁
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
t₂
1,0
1,1
1,2
1,3
1,4
1,5
1,6
1,7
t₃
2,0
2,1
2,2
2,3
2,4
2,5
2,6
2,7
t₄
3,0
3,1
3,2
3,3
3,4
3,5
3,6
3,7

PyTorch Deep Dive: How reshape and transpose work under the hood

A PyTorch tensor stores three pieces of metadata: data_ptr (starting memory address), shape (size of each dimension), and stride (number of elements to skip when moving one step along each dimension). Accessing tensor[i, j, k, l] computes the offset as i×stride[0] + j×stride[1] + k×stride[2] + l×stride[3].

For example, with shape (1, 4, 2, 3) stored contiguously, the stride is (24, 6, 3, 1):

  • reshape: Only modifies the shape metadata (stride adjusts accordingly). The underlying memory doesn’t move — no data copy
  • transpose(1, 2): Swaps the size and stride of dim1 and dim2. Shape becomes (1, 2, 4, 3), stride becomes (24, 3, 6, 1) — same memory, different traversal order

How is non-contiguity detected? A contiguous tensor’s strides must satisfy stride[i] = stride[i+1] × size[i+1] (increasing from right to left). After transpose, stride (24, 3, 6, 1) has 6 ≠ 1×3, violating this condition, so PyTorch marks it as non-contiguous. When subsequent operations like matrix multiplication require contiguous memory, .contiguous() must be called first — this rearranges data into a new memory block following the new dimension order, and is where the actual data copy happens.

The full mechanism of Multi-Head Attention will be covered in detail in a subsequent article. For now, the key takeaway is: QKV projection is the foundational step of Multi-Head Attention.

Summary

This article introduced the core concepts of Q, K, V in Transformers:

  1. Intuition: Query is the “question,” Key is the “tag,” Value is the “content” — similar to a library search
  2. Mathematical essence: Three independent linear projections Q=XWQQ = XW_Q, K=XWKK = XW_K, V=XWVV = XW_V
  3. Dimension changes: Input (S,H)(S, H) is projected to (S,dk)(S, d_k)
  4. Why three sets: Different projections let the same input play different roles in different subspaces, giving the model greater expressiveness
  5. Multi-head extension: In practice, a single projection produces QKV for all heads at once, then split via reshape

Next, we will dive into Attention Computation in Detail, examining how Q, K, V complete information aggregation through QKTQK^T, softmax, and weighted summation.