Flash Attention Tiling Principles

Introduction: Why Memory Is the Bottleneck in Standard Attention

In previous articles, we learned about the computation process of Scaled Dot-Product Attention:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

The standard implementation requires three steps:

Compute $S = QK^T \in \mathbb{R}^{N \times N}$ — store to HBM
Compute $P = \text{softmax}(S) \in \mathbb{R}^{N \times N}$ — store to HBM
Compute $O = PV \in \mathbb{R}^{N \times d}$ — store to HBM

The problem lies in the intermediate matrices $S$ and $P$ , both of size $N \times N$ . When the sequence length $N$ is large (e.g., $N = 4096$ ), these two matrices require $4096^2 \times 2 \approx 64\text{MB}$ of memory (in fp16). More critically, these matrices need to be repeatedly read from and written to the GPU’s HBM (High Bandwidth Memory), whose bandwidth is far lower than the GPU’s compute speed.

Flash Attention (Dao et al., 2022) introduced a key insight: through tiled computation and Online Softmax, we can completely avoid storing the $N \times N$ intermediate matrices, reducing memory from $O(N^2)$ to $O(N)$ while dramatically reducing the number of HBM accesses.

GPU Memory Hierarchy: SRAM vs HBM

To understand Flash Attention’s design motivation, we must first understand the GPU’s memory hierarchy.

Two-Level Storage

Storage Level	Type	A100 Capacity	Bandwidth	Characteristics
SRAM (on-chip cache)	Registers + Shared Memory	~20MB (192KB per SM)	~19 TB/s	Extremely fast, but very small capacity
HBM (High Bandwidth Memory)	GPU memory	40-80 GB	~1.5-2.0 TB/s	Large capacity, but limited bandwidth

Key data: SRAM bandwidth is ~10x that of HBM, but its capacity is only ~1/2000 of HBM.

GPU Memory Hierarchy & Data Transfer Comparison

Standard Attention(6 HBM transfers)

Step 1HBM → SRAMRead Q, K

Step 1SRAM → HBMWrite S = QKᵀ

Step 2HBM → SRAMRead S

Step 2SRAM → HBMWrite P = softmax(S)

Step 3HBM → SRAMRead P, V

Step 3SRAM → HBMWrite O = PV

Flash Attention(2 HBM transfers)

LoadHBM → SRAMRead Q, K, V blocks

Compute⟳ SRAMQKᵀ → scale → mask → softmax → ×V (all in SRAM)

WriteSRAM → HBMWrite final O only

Standard Attention requires 6 HBM transfers (3 reads + 3 writes), Flash Attention only needs 2 (1 read + 1 write)

Standard Attention’s Memory Access Pattern

The problem with the standard implementation is not the computation (FLOPs), but the memory access volume (IO):

Step 1: Read Q, K from HBM → Compute S = QK^T → Write S to HBM     (read 2Nd, write N²)
Step 2: Read S from HBM     → Compute P = softmax(S) → Write P to HBM (read N², write N²)
Step 3: Read P, V from HBM  → Compute O = PV → Write O to HBM        (read N²+Nd, write Nd)

Total HBM accesses: $\Theta(Nd + N^2)$ . When $N \gg d$ , the $N^2$ term dominates.

Flash Attention’s goal: Through tiled computation, keep all intermediate results in SRAM, reducing HBM accesses to $\Theta(N^2 d^2 M^{-1})$ , where $M$ is the SRAM size.

Tiling Strategy: Tiling Q, K, V

Flash Attention’s first technique is Tiling: splitting Q, K, V into appropriately-sized blocks so that each block fits entirely in SRAM.

Block Size Selection

Given SRAM size $M$ and head dimension $d$ :

B_c = \left\lceil \frac{M}{4d} \right\rceil, \quad B_r = \min\!\left(\left\lceil \frac{M}{4d} \right\rceil, d\right)

This ensures that a $B_r \times d$ Q block, a $B_c \times d$ K block, a $B_c \times d$ V block, and a $B_r \times B_c$ local score matrix all fit in SRAM.

Block Size Calculator

SRAM (M)100 KB

head dim (d)64

sequence length (N)512

Bc = ⌈M/(4d)⌉

400

Br = min(Bc, d)

Q blocks (Tr)

K/V blocks (Tc)

Q Matrix (512×64)

64×64

K, V Matrix (512×64)

400×64

Larger SRAM → larger blocks → fewer outer loops → less HBM access (currently 16 block computations)

Nested Loop Structure

Flash Attention uses nested loops:

Outer loop (j = 1 to T_c):     // iterate over K, V blocks
  Load K_j, V_j from HBM to SRAM
  Inner loop (i = 1 to T_r):   // iterate over Q blocks
    Load Q_i, O_i, l_i, m_i from HBM to SRAM
    Compute local attention in SRAM
    Update O_i, l_i, m_i
    Write back to HBM

Where $T_c = \lceil N / B_c \rceil$ is the number of K/V blocks and $T_r = \lceil N / B_r \rceil$ is the number of Q blocks.

Key: The $N \times N$ attention matrix is never fully materialized. Only a $B_r \times B_c$ tile is computed at a time, then discarded.

Online Softmax: Detailed Derivation of the Core Innovation

The biggest challenge with tiled computation is that softmax needs to see an entire row of data to normalize. If you only see part of the columns, how can you compute the correct softmax?

This is exactly what Online Softmax solves.

Standard Softmax Review

For a row vector $x \in \mathbb{R}^B$ , the numerically stable softmax is:

m(x) = \max_i x_i, \quad f(x) = \begin{bmatrix} e^{x_1 - m(x)} & \cdots & e^{x_B - m(x)} \end{bmatrix}, \quad \ell(x) = \sum_i f(x)_i, \quad \text{softmax}(x) = \frac{f(x)}{\ell(x)}

Where $m(x)$ is the maximum value (for numerical stability), $f(x)$ is the shifted exponential vector, and $\ell(x)$ is the normalization constant.

Splitting into Two Blocks

Suppose the vector $x$ is split into two parts $x = [x^{(1)}, x^{(2)}]$ , where $x^{(1)}, x^{(2)} \in \mathbb{R}^B$ . We want to show that the global softmax can be derived from the local statistics of the two parts.

The global maximum can be obtained from local maxima:

m(x) = \max\!\big(m(x^{(1)}), m(x^{(2)})\big)

The global shifted exponential vector:

f(x) = \begin{bmatrix} e^{m(x^{(1)}) - m(x)} f(x^{(1)}) & e^{m(x^{(2)}) - m(x)} f(x^{(2)}) \end{bmatrix}

The global normalization constant:

\ell(x) = e^{m(x^{(1)}) - m(x)} \ell(x^{(1)}) + e^{m(x^{(2)}) - m(x)} \ell(x^{(2)})

Key insight: The exponential correction factor $e^{m(x^{(1)}) - m(x)}$ compensates for the difference between the local max and the global max. If the new block has a larger maximum ( $m(x^{(2)}) > m(x^{(1)})$ ), all previous $e^{x_i - m_{\text{old}}}$ values need to be multiplied by $e^{m_{\text{old}} - m_{\text{new}}}$ to correct them.

Recurrence Algorithm

This decomposition can be applied recursively to any number of blocks. Let $m_j$ , $\ell_j$ , $O_j$ be the statistics after processing the $j$ -th block. When the $(j+1)$ -th block arrives:

Step 1: Compute local scores

\tilde{S} = Q_i K_{j+1}^T / \sqrt{d}

Step 2: Compute local statistics

\tilde{m} = \text{rowmax}(\tilde{S}), \quad \tilde{P} = \exp(\tilde{S} - \tilde{m}), \quad \tilde{\ell} = \text{rowsum}(\tilde{P})

Step 3: Update global statistics

m^{\text{new}} = \max(m_j, \tilde{m})

\ell^{\text{new}} = e^{m_j - m^{\text{new}}} \cdot \ell_j + e^{\tilde{m} - m^{\text{new}}} \cdot \tilde{\ell}

Step 4: Correct and update the output

O^{\text{new}} = \text{diag}(\ell^{\text{new}})^{-1} \!\left( \text{diag}(\ell_j) \cdot e^{m_j - m^{\text{new}}} \cdot O_j + e^{\tilde{m} - m^{\text{new}}} \cdot \tilde{P} \cdot V_{j+1} \right)

What this formula means:

$\text{diag}(\ell_j) \cdot O_j$ : “Un-normalizes” the previous output back to the state before dividing by $\ell$
$e^{m_j - m^{\text{new}}}$ : Correction factor compensating for the difference between old max and new max
$e^{\tilde{m} - m^{\text{new}}} \cdot \tilde{P} \cdot V_{j+1}$ : The new block’s contribution (also corrected to the new max)
$\text{diag}(\ell^{\text{new}})^{-1}$ : Re-normalizes with the new normalization constant

Why Is It Exact?

Online Softmax is not an approximation — it is mathematically exactly equivalent to standard softmax. The entire derivation is based on a simple algebraic identity:

\frac{e^{x_i - m_{\text{old}}}}{e^{m_{\text{new}} - m_{\text{old}}}} = e^{x_i - m_{\text{new}}}

Regardless of how many blocks the data is split into or what order they are processed in, the final result is exactly the same.

Block 1: Initialize

B1:[2.1, 3.2]

B2:[4.1, 1.5]

B3:[2.8, 3]

s₁ = [2.1, 3.2] → m₁ = max(2.1, 3.2) = 3.2 → exp(s₁ - m₁) = [0.3329, 1.0000] → l₁ = 1.3329

m = 3.2l = 1.3329O = [0.7251, 0.1499]

No need to store full N×N matrix, only maintain m, l, O accumulators

Interactive Demo: Flash Attention Tiled Computation

Below is a small example with $N=4, d=3, B=2$ , demonstrating step by step how Flash Attention processes the first Q block ( $t_1, t_2$ ), interacting with two K/V blocks sequentially, and using Online Softmax correction to obtain exact results.

Q, K, V matrices and blocking

Standard Attention requires storing the full N×N attention matrix in HBM, memory is O(N²)。Flash Attention core idea: split Q, K, V into blocks, compute in SRAM blockwise,never store the full N×N matrix。

Q ∈ ℝ^(4×3)

d₁

d₂

d₃

t₁

0.05

0.11

0.42

t₂

0.03

0.89

0.59

t₃

0.63

0.06

0.25

t₄

-0.56

0.56

0.76

(4, 3)

K ∈ ℝ^(4×3)

d₁

d₂

d₃

t₁

0.99

-0.13

0.51

t₂

-0.54

-0.85

0.13

t₃

0.17

-0.34

0.28

t₄

0.42

-0.63

-0.28

(4, 3)

V ∈ ℝ^(4×3)

d₁

d₂

d₃

t₁

-0.07

0.10

0.13

t₂

0.89

-0.59

0.14

t₃

-0.29

0.79

0.78

t₄

-0.13

0.65

0.68

(4, 3)

Blocking:block size B_r = B_c = 2。Highlighted rows = first block (t₁, t₂), non-highlighted = second block (t₃, t₄). We show processing two K/V blocks using Q's first block as example.

Memory Reduction from $O(N^2)$ to $O(N)$ : Derivation

Standard Attention Memory

The standard implementation needs to store intermediate matrices $S$ and $P$ :

\text{Memory} = \underbrace{Nd}_Q + \underbrace{Nd}_K + \underbrace{Nd}_V + \underbrace{N^2}_S + \underbrace{N^2}_P + \underbrace{Nd}_O = \Theta(Nd + N^2)

When $N \gg d$ , the $O(N^2)$ term dominates.

Flash Attention Memory

Flash Attention only needs to store inputs, outputs, and auxiliary statistics:

\text{Memory} = \underbrace{Nd}_Q + \underbrace{Nd}_K + \underbrace{Nd}_V + \underbrace{Nd}_O + \underbrace{N}_{\ell} + \underbrace{N}_{m} = \Theta(Nd)

No $N^2$ terms at all! The local $B_r \times B_c$ score matrix exists only temporarily in SRAM, not in HBM.

Theorem 1 (Dao et al., 2022): The Flash Attention algorithm returns $O = \text{softmax}(QK^T)V$ , uses $O(N^2 d)$ FLOPs, and requires only $O(N)$ additional memory.

IO Complexity Analysis: Why It’s Faster

Flash Attention not only saves memory but also saves time, because the bottleneck of Attention on GPUs is not computation but memory access.

Standard Attention IO Complexity

\text{HBM accesses} = \Theta(Nd + N^2)

Flash Attention IO Complexity

Theorem 2 (Dao et al., 2022): Let $N$ be the sequence length, $d$ the head dimension, and $M$ the SRAM size ( $d \leq M \leq Nd$ ). Standard Attention requires $\Theta(Nd + N^2)$ HBM accesses; Flash Attention requires $\Theta(N^2 d^2 M^{-1})$ .

Intuitive understanding:

The outer loop iterates over $T_c = N/B_c$ K/V blocks, each loading $\Theta(B_c d) = \Theta(M)$ data
The inner loop iterates over $T_r = N/B_r$ Q blocks, each loading and writing back $\Theta(B_r d)$ data
Total accesses: $T_c \times (M + T_r \times B_r d) = \frac{N}{B_c} \times \frac{N}{B_r} \times B_r d = \frac{N^2 d}{B_c}$
Since $B_c = \Theta(M/d)$ , we get $\frac{N^2 d}{B_c} = \Theta(N^2 d^2 / M)$

For typical parameters ( $d = 64\text{-}128$ , $M \approx 100\text{KB}$ ), $d^2$ is much smaller than $M$ , so $N^2 d^2 / M \ll N^2$ . In experiments, Flash Attention is 2-4x faster than the standard implementation.

IO Complexity Comparison: Standard vs Flash v1 vs v2

head dim d:SRAM M:

Standard Θ(Nd+N²)

Flash v1 Θ(N²d²/M)

Flash v2 Θ(N²d/M)

Long sequences: standard IO explosion vs Flash Attention sub-quadratic growth

Lower Bound

Proposition 3 (Dao et al., 2022): No exact Attention algorithm can achieve $o(N^2 d^2 M^{-1})$ HBM accesses for all $M \in [d, Nd]$ .

This means Flash Attention is asymptotically optimal in terms of IO complexity.

Flash Attention v1 vs v2

In 2023, Tri Dao released Flash Attention v2, which further optimized GPU parallelism on top of v1.

Comparison	Flash Attention v1	Flash Attention v2
Outer loop	Iterates over K/V blocks	Iterates over Q blocks
Inner loop	Iterates over Q blocks	Iterates over K/V blocks
Inter-block parallelism	Different heads & batch parallel	Additional parallelism on Q block dimension
Non-matmul FLOPs	More	Reduced, better Tensor Core utilization
Inter-warp communication	Via shared memory	Reduced inter-warp communication
A100 utilization	25-40% of theoretical peak	50-73% of theoretical peak
Relative speedup	Baseline	~2x over v1

Key Improvements in v2

1. Swapped Loop Order

v1’s outer loop iterates over K/V blocks, inner loop over Q blocks. v2 reverses this: the outer loop iterates over Q blocks, the inner loop over K/V blocks. This way, each thread block is responsible for only one Q block’s output, reducing synchronization overhead, and allows parallelism across the Q block dimension by distributing to different thread blocks (streaming multiprocessors).

2. Reduced Non-matmul FLOPs

GPU Tensor Cores have extremely high throughput for matrix multiplication, but the rescaling, max, and sum operations in Online Softmax are non-matmul FLOPs. v2 reduces the proportion of these operations through algorithmic adjustments.

3. Better Intra-warp Work Distribution

v2 optimizes task partitioning between warps, reducing the number of synchronizations through shared memory, further improving parallel efficiency.

Summary

Flash Attention solves the memory and speed bottlenecks of standard Attention through three core techniques:

Technique	Problem Solved	Effect
Tiling	$N \times N$ matrix not stored in HBM	Memory $O(N^2) \to O(N)$
Online Softmax	Correct normalization in tiled computation	Mathematically exact, zero approximation error
IO-aware design	Reduces HBM access count	2-4x speed improvement

Core formula quick reference:

m^{\text{new}} = \max(m^{\text{old}}, \tilde{m}), \quad \ell^{\text{new}} = e^{m^{\text{old}} - m^{\text{new}}} \ell^{\text{old}} + e^{\tilde{m} - m^{\text{new}}} \tilde{\ell}

O^{\text{new}} = \text{diag}(\ell^{\text{new}})^{-1}\!\left(e^{m^{\text{old}} - m^{\text{new}}} \text{diag}(\ell^{\text{old}}) O^{\text{old}} + e^{\tilde{m} - m^{\text{new}}} \tilde{P} V\right)

Flash Attention has become a standard component in modern large model inference and training. Starting from PyTorch 2.0, torch.nn.functional.scaled_dot_product_attention uses the Flash Attention backend by default. Understanding its tiling principles is an important foundation for deeply understanding LLM system optimization.

Introduction: Why Memory Is the Bottleneck in Standard Attention

GPU Memory Hierarchy: SRAM vs HBM

Two-Level Storage

GPU Memory Hierarchy & Data Transfer Comparison

Standard Attention(6 HBM transfers)

Flash Attention(2 HBM transfers)

Standard Attention’s Memory Access Pattern

Tiling Strategy: Tiling Q, K, V

Block Size Selection

Block Size Calculator

Nested Loop Structure

Online Softmax: Detailed Derivation of the Core Innovation

Standard Softmax Review

Splitting into Two Blocks

Recurrence Algorithm

Why Is It Exact?

Interactive Demo: Flash Attention Tiled Computation

Memory Reduction from O(N2)O(N^2)O(N2) to O(N)O(N)O(N): Derivation

Standard Attention Memory

Flash Attention Memory

IO Complexity Analysis: Why It’s Faster

Standard Attention IO Complexity

Flash Attention IO Complexity

IO Complexity Comparison: Standard vs Flash v1 vs v2

Lower Bound

Flash Attention v1 vs v2

Key Improvements in v2

Summary

Memory Reduction from $O(N^2)$ to $O(N)$ : Derivation