Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Attention Computation in Detail

Attention Computation in Detail

Updated 2026-04-06

Introduction: Attention Is the Core of the Transformer

In the previous article, we learned how the Q, K, V matrices are obtained through linear projection. This article will dissect the computation process of Attention in detail โ€” starting from Q, K, V and deriving the final output step by step.

The essence of the Attention mechanism is a differentiable soft retrieval: using Queries to match all Keys, then computing a weighted average of Values based on the degree of matching. Each tokenโ€™s output is no longer fixed but dynamically aggregated based on context.

The Complete Formula: Scaled Dot-Product Attention

The form of Attention used in Transformers is called Scaled Dot-Product Attention:

Attention(Q,K,V)=softmaxโ€‰โฃ(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Where:

  • QโˆˆRSร—dkQ \in \mathbb{R}^{S \times d_k} โ€” Query matrix
  • KโˆˆRSร—dkK \in \mathbb{R}^{S \times d_k} โ€” Key matrix
  • VโˆˆRSร—dvV \in \mathbb{R}^{S \times d_v} โ€” Value matrix (typically dv=dkd_v = d_k)
  • dkd_k โ€” dimension of each attention head
  • dk\sqrt{d_k} โ€” scaling factor to prevent dot products from growing too large

This formula appears concise but contains five key steps. Let us break them down one by one.

Input X(S, H)Q = XยทWq(S, H)K = XยทWk(S, H)V = XยทWv(S, H)reshape(S, d_k)QยทKแต€(S, S)รท โˆšd_k(S, S)+ mask(S, S)softmax(S, S)ร— V(S, d_v)concat(S, H)ร— Wo(S, H)
Input X
(S, H)

Hidden representation of input sequence

Simple mode omits batch (B) and multi-head (h) dimensions

Step-by-Step Breakdown: The Mathematical Meaning of Each Step

Step 1: QKTQK^T โ€” Computing Raw Attention Scores

Scores=QKTโˆˆRSร—S\text{Scores} = QK^T \in \mathbb{R}^{S \times S}

This is a matrix multiplication: QQ has shape (S,dk)(S, d_k), KTK^T has shape (dk,S)(d_k, S), and the result is (S,S)(S, S).

Intuition: Entry (i,j)(i, j) in the result matrix is the dot product of Query vector qiq_i and Key vector kjk_j:

Scoresij=qiโ‹…kj=โˆ‘l=1dkqilโ‹…kjl\text{Scores}_{ij} = q_i \cdot k_j = \sum_{l=1}^{d_k} q_{il} \cdot k_{jl}

The dot product measures the โ€œsimilarityโ€ between two vectors: the larger the value, the more attention token ii pays to token jj.

Step 2: Divide by dk\sqrt{d_k} โ€” Scaling

Scaled=QKTdk\text{Scaled} = \frac{QK^T}{\sqrt{d_k}}

Why is scaling needed? This is not an arbitrary design choice but is based on rigorous statistical analysis. See the โ€œNecessity of Scalingโ€ section below for details.

Step 3: Mask โ€” Masking (Optional)

Maskedij={Scaledijifย jโ‰คiโˆ’โˆžifย j>i\text{Masked}_{ij} = \begin{cases} \text{Scaled}_{ij} & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

In the Decoderโ€™s self-attention, token ii cannot see tokens after position ii (because those tokens do not yet exist during autoregressive generation). By setting the upper triangle to โˆ’โˆž-\infty, the corresponding weights become 0 after softmax. See the โ€œCausal Maskโ€ section below for details.

Step 4: Softmax โ€” Row Normalization

Weightsi=softmax(Maskedi)=eMaskedijโˆ‘jeMaskedij\text{Weights}_i = \text{softmax}(\text{Masked}_i) = \frac{e^{\text{Masked}_{ij}}}{\sum_{j} e^{\text{Masked}_{ij}}}

Softmax is applied independently to each row of the score matrix, converting raw scores into a probability distribution (non-negative and summing to 1).

Step 5: Multiply by VV โ€” Weighted Summation

Output=Weightsโ‹…VโˆˆRSร—dv\text{Output} = \text{Weights} \cdot V \in \mathbb{R}^{S \times d_v}

The weight matrix (S,S)(S, S) multiplied by the Value matrix (S,dv)(S, d_v) produces the final output (S,dv)(S, d_v). Each tokenโ€™s output is a weighted average of all Value vectors:

Outputi=โˆ‘j=1SWeightsijโ‹…vj\text{Output}_i = \sum_{j=1}^{S} \text{Weights}_{ij} \cdot v_j

Interactive Animation: The Full Attention Computation Process

Below is a small example (S=4S=4, dk=3d_k=3) demonstrating the complete computation across all five steps. Click โ€œNext Stepโ€ to progress through each stage.

Q and K Matrices

From the previous linear projection step, we have obtained Q and K matrices, both with shape (S=4, d_k=3). Next we compute attention scores between them.

Q โˆˆ โ„^(4ร—3)
dโ‚
dโ‚‚
dโ‚ƒ
tโ‚
0.11
0.77
0.08
tโ‚‚
-0.55
-0.28
0.24
tโ‚ƒ
-0.79
0.97
-0.74
tโ‚„
-0.72
0.95
0.36
(4, 3)
K โˆˆ โ„^(4ร—3)
dโ‚
dโ‚‚
dโ‚ƒ
tโ‚
-0.89
-0.34
-0.17
tโ‚‚
-0.17
0.81
-0.10
tโ‚ƒ
0.86
-0.99
0.30
tโ‚„
0.10
-0.91
0.92
(4, 3)

Necessity of Scaling: Why Divide by dk\sqrt{d_k}

This is one of the most frequently asked questions in interviews and study. The original paper (Vaswani et al., 2017) provides a clear explanation:

Statistical Analysis

Assume each component of qq and kk is an independent random variable with mean 0 and variance 1. Then their dot product:

qโ‹…k=โˆ‘l=1dkqlโ‹…klq \cdot k = \sum_{l=1}^{d_k} q_l \cdot k_l

has the following statistical properties:

E[qโ‹…k]=0,Var(qโ‹…k)=dk\mathbb{E}[q \cdot k] = 0, \quad \text{Var}(q \cdot k) = d_k

Variance derivation: The variance of each qlโ‹…klq_l \cdot k_l is Var(ql)โ‹…Var(kl)=1\text{Var}(q_l) \cdot \text{Var}(k_l) = 1 (since the variance of the product of zero-mean random variables equals the product of their variances), and summing dkd_k independent terms gives a total variance of dkd_k.

The Problem

When dkd_k is large (e.g., GPT-3 uses dk=128d_k = 128), the magnitude of the dot product is approximately 128โ‰ˆ11.3\sqrt{128} \approx 11.3. This means the softmax inputs become very large, leading to:

  1. Softmax output approaches one-hot: softmax([10,1,1])โ‰ˆ[0.9999,0.0001,0.0001]\text{softmax}([10, 1, 1]) \approx [0.9999, 0.0001, 0.0001]
  2. Gradients nearly vanish: In the saturated region of softmax, gradients approach 0, preventing the model from learning effectively

The Solution

After dividing by dk\sqrt{d_k}, the variance of the dot product is restored to 1:

Varโ€‰โฃ(qโ‹…kdk)=dkdk=1\text{Var}\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1

This keeps the softmax inputs within a reasonable range, ensuring smooth gradient flow and stable training.

Unscaled: QKT
k1k2k3k4k5k6k7k8
Variance: 5.14 ยท Softmax entropy: 2.056 bits
โ†’ Softmax output
k1k2k3k4k5k6k7k8
Scaled: QKT / โˆšdk
k1k2k3k4k5k6k7k8
Variance: 0.08 ยท Softmax entropy: 2.947 bits
โ†’ Softmax output
k1k2k3k4k5k6k7k8
Observation:Larger d โ†’ higher variance in unscaled scores โ†’ Softmax output approaches one-hot (entropy โ†’ 0). Dividing by โˆšd restores variance to ~1, Softmax output maintains uniform distribution (entropy โ‰ˆ 3.0 bits).

Original paper quote: โ€œWe suspect that for large values of dkd_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.โ€

Causal Mask: The Decoderโ€™s Causal Masking

Why Masking Is Needed

In autoregressive language models, when generating token ii, only tokens 1,2,โ€ฆ,i1, 2, \ldots, i are visible โ€” tokens i+1,i+2,โ€ฆi+1, i+2, \ldots cannot be seen (because they have not been generated yet).

During training, for parallelization, we feed the entire sequence at once but need to simulate the โ€œcannot see the futureโ€ constraint through masking.

The Mask Matrix

Mij={0ifย jโ‰คiโˆ’โˆžifย j>iM_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

The mask is added to the scaled scores: Masked=Scaled+M\text{Masked} = \text{Scaled} + M

Since eโˆ’โˆž=0e^{-\infty} = 0, the weights at masked positions become 0 after softmax.

Shape of the Mask

For sequence length S=4S = 4:

M=(0โˆ’โˆžโˆ’โˆžโˆ’โˆž00โˆ’โˆžโˆ’โˆž000โˆ’โˆž0000)M = \begin{pmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{pmatrix}

This is a lower triangular matrix. Row ii retains only the scores for the first ii positions.

Raw QKแต€ Score Matrix

Raw dot product scores โ€” each cell shows token i's raw relevance to token j.

Scores = QKแต€/โˆšd_k
ThecatsatonitThecatsatonit1.502.73-0.752.43-0.210.091.681.650.060.24-0.06-1.26-0.33-0.15-2.16-0.66-1.20-0.63-0.540.180.872.07-2.761.560.66

Masking Strategies for Different Scenarios

ScenarioMask TypeDescription
Encoder Self-AttentionNo mask or padding maskBidirectional attention, can see the entire sequence
Decoder Self-AttentionCausal maskCan only see current and preceding tokens
Cross-AttentionPadding maskDecoder queries Encoder output, no causal constraint

Numerical Stability of Softmax

The Overflow Problem

A naive softmax implementation:

softmax(xi)=exiโˆ‘jexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

When xix_i is very large (e.g., xi=1000x_i = 1000), e1000e^{1000} exceeds the representable range of floating-point numbers, causing numerical overflow (producing Inf or NaN).

The Standard Trick: Subtract the Maximum

softmax(xi)=exiโˆ’maxโก(x)โˆ‘jexjโˆ’maxโก(x)\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}

This is mathematically equivalent (multiplying both numerator and denominator by eโˆ’maxโก(x)e^{-\max(x)}), but ensures the exponent inputs are โ‰ค0\leq 0, so exiโˆ’maxโก(x)โ‰ค1e^{x_i - \max(x)} \leq 1, preventing overflow.

Proof of equivalence:

exiโˆ’mโˆ‘jexjโˆ’m=exiโ‹…eโˆ’mโˆ‘jexjโ‹…eโˆ’m=exiโˆ‘jexj\frac{e^{x_i - m}}{\sum_j e^{x_j - m}} = \frac{e^{x_i} \cdot e^{-m}}{\sum_j e^{x_j} \cdot e^{-m}} = \frac{e^{x_i}}{\sum_j e^{x_j}}

Where m=maxโก(x)m = \max(x).

Implementation in Practice

All major deep learning frameworks (PyTorch, JAX, TensorFlow) have this trick built into their softmax implementations. In optimized implementations like Flash Attention, maintaining numerical stability during tiled computation is a more complex problem that we will discuss in subsequent articles.

Summary

The computation of Scaled Dot-Product Attention can be decomposed into five clear steps:

StepOperationOutput ShapePurpose
1QKTQK^T(S,S)(S, S)Compute similarity between all token pairs
2รทdk\div \sqrt{d_k}(S,S)(S, S)Prevent large dot products from causing vanishing gradients
3+Mask+ \text{Mask}(S,S)(S, S)Mask positions that should not be attended to
4Softmax(S,S)(S, S)Normalize into a probability distribution
5ร—V\times V(S,dv)(S, d_v)Aggregate Values by attention weights

Core intuition: Attention is essentially a โ€œsoft addressingโ€ mechanism โ€” each token uses its Query to match against all Keys and extracts information from all Values based on match quality. Scaling ensures training stability, and masking ensures causality.

The next article will introduce Multi-Head Attention โ€” how multiple attention heads operate in parallel and combine to further enhance the modelโ€™s expressiveness.