QKV Data Structures and Intuition

Introduction: Why We Need QKV

In the previous article, we saw the core formula of Self-Attention:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Three matrices appear in this formula: Q (Query), K (Key), and V (Value). Where do they come from? Why do we need three different matrices instead of directly using the input itself?

The answer is: through three separate linear projections, the same input plays different roles in different subspaces. This allows the model to flexibly learn three fundamentally different functions: “what to ask,” “what to match against,” and “what information to extract.”

Intuitive Understanding: The Library Search Analogy

Imagine you walk into a library looking for information about “Transformer architecture”:

Role	Analogy	Explanation
Query	The question in your mind	”I want to learn about Transformer architecture” — you search with this question
Key	Book tags / index cards	Each book has keyword tags: “deep learning,” “attention mechanism,” “NLP,” etc.
Value	The actual book content	After the tags match, what you actually read is the text inside the book

The search process works as follows:

Query compared with Keys: Your question (Query) is matched against each book’s tags (Key) to compute a “relevance score”
Softmax normalization: Scores are converted into a probability distribution — the most relevant books get the highest weights
Weighted extraction of Values: Based on relevance weights, information is extracted from each book’s content (Value)

The key insight is: the tags (Key) and the content (Value) are different representations of the same book. Tags are used for quick matching, while the content is the information you actually need. Similarly, in Attention, Keys are used to compute attention weights, and Values are used to aggregate information — they come from the same input but serve different functions through different projections.

If we skip the projections and directly use the input $X$ as Q, K, and V simultaneously, the model can only compute “self-similarity with itself,” resulting in extremely limited expressiveness. Three independent projections give the model much greater flexibility.

Mathematical Definition: Linear Projection Formulas and Dimensions

Input

Let the input sequence matrix after Embedding be $X$ :

X \in \mathbb{R}^{S \times H}

Where:

$S$ is the sequence length (number of tokens)
$H$ is the hidden dimension (hidden size)

X:(S=seq_len,H=hidden)

Weight Matrices

Three sets of learnable weight matrices:

W_Q \in \mathbb{R}^{H \times d_k}, \quad W_K \in \mathbb{R}^{H \times d_k}, \quad W_V \in \mathbb{R}^{H \times d_v}

Where $d_k$ (also commonly written as $d_{\text{head}}$ ) is the dimension of each attention head. In the standard Transformer, typically $d_k = d_v$ .

Linear Projection

Computing Q, K, V is very straightforward — it is just matrix multiplication:

Q = X W_Q \in \mathbb{R}^{S \times d_k}

K = X W_K \in \mathbb{R}^{S \times d_k}

V = X W_V \in \mathbb{R}^{S \times d_v}

Each token’s $H$ -dimensional hidden vector is projected (linearly transformed) into a new $d_k$ -dimensional space. This is the so-called “linear projection.”

Why is it called “projection”? From a geometric perspective, multiplying by a weight matrix projects a vector from a high-dimensional space onto a lower-dimensional subspace. Different weight matrices define different subspaces — the Query space, Key space, and Value space.

Q Space

"What am I looking for"

K Space

"What can I provide"

V Space

"My actual content"

catsitsonmatthe

Step-by-Step Visualization

Below is a small example ( $S=4$ , $H=6$ , $d_k=3$ ) demonstrating the complete QKV linear projection process:

Input Matrix X

Input matrix X has shape (S=4, H=6). Each row represents a hidden representation vector for one token.

X ∈ ℝ^(4×6)

h₁

h₂

h₃

h₄

h₅

h₆

t₁

0.05

0.11

0.42

0.03

0.89

0.59

t₂

0.63

0.06

0.25

0.44

0.56

0.76

t₃

0.32

0.72

0.77

0.19

0.43

0.64

t₄

0.93

0.80

0.29

0.82

0.80

0.08

(4, 6)

Tensor Shape Tracking

The dimension changes at each step from input to Q, K, V:

Step	Operation	Shape	Description
1	Input $X$	$(S, H)$	Each row is a token’s hidden vector
2	Weight matrix $W_Q$	$(H, d_k)$	Learnable parameters
3	$Q = X W_Q$	$(S, d_k)$	Matrix multiplication: $(S, H) \times (H, d_k) \to (S, d_k)$
4	Weight matrix $W_K$	$(H, d_k)$	Another set of learnable parameters
5	$K = X W_K$	$(S, d_k)$	$(S, H) \times (H, d_k) \to (S, d_k)$
6	Weight matrix $W_V$	$(H, d_v)$	A third set of learnable parameters
7	$V = X W_V$	$(S, d_v)$	$(S, H) \times (H, d_v) \to (S, d_v)$

With the batch dimension included, the full tensor shapes are:

Input X:(B=batch,S=seq_len,H=hidden)

After projection:

Q, K, V:(B=batch,S=seq_len,d_k=head_dim)

Using GPT-2 Small as an example ( $H=768$ , $h=12$ heads):

Parameter	Value	Description
$H$	768	Hidden dimension
$h$	12	Number of attention heads
$d_k$	64	Per-head dimension = $768 / 12$

Before projection, each token is a 768-dimensional vector; after projection, it becomes a 64-dimensional Q/K/V vector (within a single head).

QKV in the Multi-Head Setting

In Multi-Head Attention, the hidden dimension is split across $h$ heads:

d_k = \frac{H}{h}

Why set $d_k \times h = H$ ? This is not a mathematical necessity but an engineering design choice:

Parameter conservation: The combined weight matrix is exactly $\mathbb{R}^{H \times H}$ , keeping the parameter count the same as a single linear layer — introducing multiple heads doesn’t add parameters

Implementation efficiency: A single matmul $(B, S, H) \times (H, H) \to (B, S, H)$ followed by a reshape splits out all heads. If each head had a different $d_k$ , reshape wouldn’t work — you’d need uneven splits, which are less efficient

Clean residual connections: The Attention output must be added back to the input (dimension $H$ ). Each head outputs $d_v$ dimensions, and $h$ heads concatenate to $h \times d_v$ . When $d_v = H/h$ , the concatenated dimension is exactly $H$ , and the output projection $W_O \in \mathbb{R}^{H \times H}$ stays a square matrix. A different concatenated dimension could still be projected back with a differently-sized $W_O$ , but this is less elegant

Notably, later architectures like GQA (Grouped Query Attention) and MQA (Multi-Query Attention) relax the assumption that Q, K, and V must have the same number of heads — MQA shares a single K/V head across all Q heads, while GQA shares K/V within groups. However, the Q projection still satisfies $d_k \times h = H$ ; what changes is the K/V head count, reducing KV Cache size.

In practice, the implementation does not actually perform separate matrix multiplications for each head. Instead, a single large weight matrix projects all heads’ results at once, followed by a reshape:

W_Q^{\text{all}} \in \mathbb{R}^{H \times H}

Q^{\text{all}} = X W_Q^{\text{all}} \in \mathbb{R}^{S \times H}

Then $Q^{\text{all}}$ is reshaped from $(S, H)$ to $(S, h, d_k)$ , and transposed to $(h, S, d_k)$ , so each head has its own Query matrix.

(B, S, H) \xrightarrow{W_Q} (B, S, H) \xrightarrow{\text{reshape}} (B, S, h, d_k) \xrightarrow{\text{transpose}} (B, h, S, d_k)

(S, H) Original projection result

Projected matrix (S, H) = (4, 8)，All elements are undifferentiated by head, using uniform color.

Q ∈ (4, 8)

t₁

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

t₂

1,0

1,1

1,2

1,3

1,4

1,5

1,6

1,7

t₃

2,0

2,1

2,2

2,3

2,4

2,5

2,6

2,7

t₄

3,0

3,1

3,2

3,3

3,4

3,5

3,6

3,7

PyTorch Deep Dive: How reshape and transpose work under the hood

A PyTorch tensor stores three pieces of metadata: data_ptr (starting memory address), shape (size of each dimension), and stride (number of elements to skip when moving one step along each dimension). Accessing tensor[i, j, k, l] computes the offset as i×stride[0] + j×stride[1] + k×stride[2] + l×stride[3].

For example, with shape (1, 4, 2, 3) stored contiguously, the stride is (24, 6, 3, 1):

reshape: Only modifies the shape metadata (stride adjusts accordingly). The underlying memory doesn’t move — no data copy

transpose(1, 2): Swaps the size and stride of dim1 and dim2. Shape becomes (1, 2, 4, 3), stride becomes (24, 3, 6, 1) — same memory, different traversal order

How is non-contiguity detected? A contiguous tensor’s strides must satisfy stride[i] = stride[i+1] × size[i+1] (increasing from right to left). After transpose, stride (24, 3, 6, 1) has 6 ≠ 1×3, violating this condition, so PyTorch marks it as non-contiguous. When subsequent operations like matrix multiplication require contiguous memory, .contiguous() must be called first — this rearranges data into a new memory block following the new dimension order, and is where the actual data copy happens.

The full mechanism of Multi-Head Attention will be covered in detail in a subsequent article. For now, the key takeaway is: QKV projection is the foundational step of Multi-Head Attention.

Summary

This article introduced the core concepts of Q, K, V in Transformers:

Intuition: Query is the “question,” Key is the “tag,” Value is the “content” — similar to a library search
Mathematical essence: Three independent linear projections $Q = XW_Q$ , $K = XW_K$ , $V = XW_V$
Dimension changes: Input $(S, H)$ is projected to $(S, d_k)$
Why three sets: Different projections let the same input play different roles in different subspaces, giving the model greater expressiveness
Multi-head extension: In practice, a single projection produces QKV for all heads at once, then split via reshape

Next, we will dive into Attention Computation in Detail, examining how Q, K, V complete information aggregation through $QK^T$ , softmax, and weighted summation.