Attention Computation in Detail
Updated 2026-04-06
Introduction: Attention Is the Core of the Transformer
In the previous article, we learned how the Q, K, V matrices are obtained through linear projection. This article will dissect the computation process of Attention in detail โ starting from Q, K, V and deriving the final output step by step.
The essence of the Attention mechanism is a differentiable soft retrieval: using Queries to match all Keys, then computing a weighted average of Values based on the degree of matching. Each tokenโs output is no longer fixed but dynamically aggregated based on context.
The Complete Formula: Scaled Dot-Product Attention
The form of Attention used in Transformers is called Scaled Dot-Product Attention:
Where:
- โ Query matrix
- โ Key matrix
- โ Value matrix (typically )
- โ dimension of each attention head
- โ scaling factor to prevent dot products from growing too large
This formula appears concise but contains five key steps. Let us break them down one by one.
Hidden representation of input sequence
Simple mode omits batch (B) and multi-head (h) dimensions
Step-by-Step Breakdown: The Mathematical Meaning of Each Step
Step 1: โ Computing Raw Attention Scores
This is a matrix multiplication: has shape , has shape , and the result is .
Intuition: Entry in the result matrix is the dot product of Query vector and Key vector :
The dot product measures the โsimilarityโ between two vectors: the larger the value, the more attention token pays to token .
Step 2: Divide by โ Scaling
Why is scaling needed? This is not an arbitrary design choice but is based on rigorous statistical analysis. See the โNecessity of Scalingโ section below for details.
Step 3: Mask โ Masking (Optional)
In the Decoderโs self-attention, token cannot see tokens after position (because those tokens do not yet exist during autoregressive generation). By setting the upper triangle to , the corresponding weights become 0 after softmax. See the โCausal Maskโ section below for details.
Step 4: Softmax โ Row Normalization
Softmax is applied independently to each row of the score matrix, converting raw scores into a probability distribution (non-negative and summing to 1).
Step 5: Multiply by โ Weighted Summation
The weight matrix multiplied by the Value matrix produces the final output . Each tokenโs output is a weighted average of all Value vectors:
Interactive Animation: The Full Attention Computation Process
Below is a small example (, ) demonstrating the complete computation across all five steps. Click โNext Stepโ to progress through each stage.
From the previous linear projection step, we have obtained Q and K matrices, both with shape (S=4, d_k=3). Next we compute attention scores between them.
Necessity of Scaling: Why Divide by
This is one of the most frequently asked questions in interviews and study. The original paper (Vaswani et al., 2017) provides a clear explanation:
Statistical Analysis
Assume each component of and is an independent random variable with mean 0 and variance 1. Then their dot product:
has the following statistical properties:
Variance derivation: The variance of each is (since the variance of the product of zero-mean random variables equals the product of their variances), and summing independent terms gives a total variance of .
The Problem
When is large (e.g., GPT-3 uses ), the magnitude of the dot product is approximately . This means the softmax inputs become very large, leading to:
- Softmax output approaches one-hot:
- Gradients nearly vanish: In the saturated region of softmax, gradients approach 0, preventing the model from learning effectively
The Solution
After dividing by , the variance of the dot product is restored to 1:
This keeps the softmax inputs within a reasonable range, ensuring smooth gradient flow and stable training.
Original paper quote: โWe suspect that for large values of , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.โ
Causal Mask: The Decoderโs Causal Masking
Why Masking Is Needed
In autoregressive language models, when generating token , only tokens are visible โ tokens cannot be seen (because they have not been generated yet).
During training, for parallelization, we feed the entire sequence at once but need to simulate the โcannot see the futureโ constraint through masking.
The Mask Matrix
The mask is added to the scaled scores:
Since , the weights at masked positions become 0 after softmax.
Shape of the Mask
For sequence length :
This is a lower triangular matrix. Row retains only the scores for the first positions.
Raw dot product scores โ each cell shows token i's raw relevance to token j.
Masking Strategies for Different Scenarios
| Scenario | Mask Type | Description |
|---|---|---|
| Encoder Self-Attention | No mask or padding mask | Bidirectional attention, can see the entire sequence |
| Decoder Self-Attention | Causal mask | Can only see current and preceding tokens |
| Cross-Attention | Padding mask | Decoder queries Encoder output, no causal constraint |
Numerical Stability of Softmax
The Overflow Problem
A naive softmax implementation:
When is very large (e.g., ), exceeds the representable range of floating-point numbers, causing numerical overflow (producing Inf or NaN).
The Standard Trick: Subtract the Maximum
This is mathematically equivalent (multiplying both numerator and denominator by ), but ensures the exponent inputs are , so , preventing overflow.
Proof of equivalence:
Where .
Implementation in Practice
All major deep learning frameworks (PyTorch, JAX, TensorFlow) have this trick built into their softmax implementations. In optimized implementations like Flash Attention, maintaining numerical stability during tiled computation is a more complex problem that we will discuss in subsequent articles.
Summary
The computation of Scaled Dot-Product Attention can be decomposed into five clear steps:
| Step | Operation | Output Shape | Purpose |
|---|---|---|---|
| 1 | Compute similarity between all token pairs | ||
| 2 | Prevent large dot products from causing vanishing gradients | ||
| 3 | Mask positions that should not be attended to | ||
| 4 | Softmax | Normalize into a probability distribution | |
| 5 | Aggregate Values by attention weights |
Core intuition: Attention is essentially a โsoft addressingโ mechanism โ each token uses its Query to match against all Keys and extracts information from all Values based on match quality. Scaling ensures training stability, and masking ensures causality.
The next article will introduce Multi-Head Attention โ how multiple attention heads operate in parallel and combine to further enhance the modelโs expressiveness.