QKV Data Structures and Intuition
Updated 2026-04-06
Introduction: Why We Need QKV
In the previous article, we saw the core formula of Self-Attention:
Three matrices appear in this formula: Q (Query), K (Key), and V (Value). Where do they come from? Why do we need three different matrices instead of directly using the input itself?
The answer is: through three separate linear projections, the same input plays different roles in different subspaces. This allows the model to flexibly learn three fundamentally different functions: “what to ask,” “what to match against,” and “what information to extract.”
Intuitive Understanding: The Library Search Analogy
Imagine you walk into a library looking for information about “Transformer architecture”:
| Role | Analogy | Explanation |
|---|---|---|
| Query | The question in your mind | ”I want to learn about Transformer architecture” — you search with this question |
| Key | Book tags / index cards | Each book has keyword tags: “deep learning,” “attention mechanism,” “NLP,” etc. |
| Value | The actual book content | After the tags match, what you actually read is the text inside the book |
The search process works as follows:
- Query compared with Keys: Your question (Query) is matched against each book’s tags (Key) to compute a “relevance score”
- Softmax normalization: Scores are converted into a probability distribution — the most relevant books get the highest weights
- Weighted extraction of Values: Based on relevance weights, information is extracted from each book’s content (Value)
The key insight is: the tags (Key) and the content (Value) are different representations of the same book. Tags are used for quick matching, while the content is the information you actually need. Similarly, in Attention, Keys are used to compute attention weights, and Values are used to aggregate information — they come from the same input but serve different functions through different projections.
If we skip the projections and directly use the input as Q, K, and V simultaneously, the model can only compute “self-similarity with itself,” resulting in extremely limited expressiveness. Three independent projections give the model much greater flexibility.
Mathematical Definition: Linear Projection Formulas and Dimensions
Input
Let the input sequence matrix after Embedding be :
Where:
- is the sequence length (number of tokens)
- is the hidden dimension (hidden size)
Weight Matrices
Three sets of learnable weight matrices:
Where (also commonly written as ) is the dimension of each attention head. In the standard Transformer, typically .
Linear Projection
Computing Q, K, V is very straightforward — it is just matrix multiplication:
Each token’s -dimensional hidden vector is projected (linearly transformed) into a new -dimensional space. This is the so-called “linear projection.”
Why is it called “projection”? From a geometric perspective, multiplying by a weight matrix projects a vector from a high-dimensional space onto a lower-dimensional subspace. Different weight matrices define different subspaces — the Query space, Key space, and Value space.
Step-by-Step Visualization
Below is a small example (, , ) demonstrating the complete QKV linear projection process:
Input matrix X has shape (S=4, H=6). Each row represents a hidden representation vector for one token.
Tensor Shape Tracking
The dimension changes at each step from input to Q, K, V:
| Step | Operation | Shape | Description |
|---|---|---|---|
| 1 | Input | Each row is a token’s hidden vector | |
| 2 | Weight matrix | Learnable parameters | |
| 3 | Matrix multiplication: | ||
| 4 | Weight matrix | Another set of learnable parameters | |
| 5 | |||
| 6 | Weight matrix | A third set of learnable parameters | |
| 7 |
With the batch dimension included, the full tensor shapes are:
After projection:
Using GPT-2 Small as an example (, heads):
| Parameter | Value | Description |
|---|---|---|
| 768 | Hidden dimension | |
| 12 | Number of attention heads | |
| 64 | Per-head dimension = |
Before projection, each token is a 768-dimensional vector; after projection, it becomes a 64-dimensional Q/K/V vector (within a single head).
QKV in the Multi-Head Setting
In Multi-Head Attention, the hidden dimension is split across heads:
Why set ? This is not a mathematical necessity but an engineering design choice:
- Parameter conservation: The combined weight matrix is exactly , keeping the parameter count the same as a single linear layer — introducing multiple heads doesn’t add parameters
- Implementation efficiency: A single matmul followed by a reshape splits out all heads. If each head had a different , reshape wouldn’t work — you’d need uneven splits, which are less efficient
- Clean residual connections: The Attention output must be added back to the input (dimension ). Each head outputs dimensions, and heads concatenate to . When , the concatenated dimension is exactly , and the output projection stays a square matrix. A different concatenated dimension could still be projected back with a differently-sized , but this is less elegant
Notably, later architectures like GQA (Grouped Query Attention) and MQA (Multi-Query Attention) relax the assumption that Q, K, and V must have the same number of heads — MQA shares a single K/V head across all Q heads, while GQA shares K/V within groups. However, the Q projection still satisfies ; what changes is the K/V head count, reducing KV Cache size.
In practice, the implementation does not actually perform separate matrix multiplications for each head. Instead, a single large weight matrix projects all heads’ results at once, followed by a reshape:
Then is reshaped from to , and transposed to , so each head has its own Query matrix.
Projected matrix (S, H) = (4, 8),All elements are undifferentiated by head, using uniform color.
PyTorch Deep Dive: How reshape and transpose work under the hood
A PyTorch tensor stores three pieces of metadata: data_ptr (starting memory address), shape (size of each dimension), and stride (number of elements to skip when moving one step along each dimension). Accessing
tensor[i, j, k, l]computes the offset asi×stride[0] + j×stride[1] + k×stride[2] + l×stride[3].For example, with shape
(1, 4, 2, 3)stored contiguously, the stride is(24, 6, 3, 1):
- reshape: Only modifies the shape metadata (stride adjusts accordingly). The underlying memory doesn’t move — no data copy
- transpose(1, 2): Swaps the size and stride of dim1 and dim2. Shape becomes
(1, 2, 4, 3), stride becomes(24, 3, 6, 1)— same memory, different traversal orderHow is non-contiguity detected? A contiguous tensor’s strides must satisfy
stride[i] = stride[i+1] × size[i+1](increasing from right to left). After transpose, stride(24, 3, 6, 1)has6 ≠ 1×3, violating this condition, so PyTorch marks it as non-contiguous. When subsequent operations like matrix multiplication require contiguous memory,.contiguous()must be called first — this rearranges data into a new memory block following the new dimension order, and is where the actual data copy happens.
The full mechanism of Multi-Head Attention will be covered in detail in a subsequent article. For now, the key takeaway is: QKV projection is the foundational step of Multi-Head Attention.
Summary
This article introduced the core concepts of Q, K, V in Transformers:
- Intuition: Query is the “question,” Key is the “tag,” Value is the “content” — similar to a library search
- Mathematical essence: Three independent linear projections , ,
- Dimension changes: Input is projected to
- Why three sets: Different projections let the same input play different roles in different subspaces, giving the model greater expressiveness
- Multi-head extension: In practice, a single projection produces QKV for all heads at once, then split via reshape
Next, we will dive into Attention Computation in Detail, examining how Q, K, V complete information aggregation through , softmax, and weighted summation.