Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations
Updated 2026-04-13
Introduction
The core computation of deep learning models consists of dense matrix operations — matmul, conv, layer norm. These operations share a common trait: they are essentially nested loops that iterate regularly over multi-dimensional arrays.
# Naive matrix multiplication: three nested loops
for i in range(M):
for j in range(N):
for k in range(K):
C[i, j] += A[i, k] * B[k, j]
This code is extremely inefficient on modern hardware. The reason: modern CPU/GPU performance bottlenecks have shifted from computation to memory access. L1 cache access latency is 4 cycles, DRAM access latency is 200+ cycles. Naive loop implementations cause massive cache misses: each access to B[k,j] results in poor cache line utilization due to stride access in the j dimension.
Polyhedral Optimization is a compiler technique specialized for nested loops. Its core idea: model the loop’s iteration space and dependence relations as polyhedra, and use linear algebra and integer linear programming (ILP) tools to automatically find optimal loop transformations.
This theory matured in the 1980s-90s (work by Banerjee, Wolfe, Lam, etc.) but remained largely in academia and HPC. MLIR’s Affine Dialect brings polyhedral models into mainstream ML compiler infrastructure for the first time. This article deeply analyzes the mathematical foundations and engineering practices of this technique.
Why Loop Transformations: Cache Locality & Parallelism
Performance Issues of Naive Implementation
A simple matmul performance comparison (1024×1024 matrices, single-threaded, Intel i7-12700K):
| Implementation | Performance (GFLOPS) | Relative Speedup |
|---|---|---|
| Naive IJK loops | 2.3 | 1× baseline |
| Loop Interchange (IKJ) | 15.7 | 6.8× |
| Tiling (32×32) | 42.1 | 18.3× |
| Tiling + SIMD | 118.6 | 51.6× |
| cuBLAS (GPU) | 15,320 | 6,661× |
The naive implementation achieves only 0.1% of theoretical peak. Through loop transformations (interchange + tiling), we gain 18× speedup, reaching ~2% of peak. The remaining gap comes from SIMD vectorization, register blocking, software pipelining, and other lower-level optimizations.
Cache Miss Analysis
Why is naive IJK so inefficient? Analyze cache behavior with perf:
$ perf stat -e L1-dcache-loads,L1-dcache-load-misses ./naive_matmul
10,240,000,000 L1-dcache-loads # 3 accesses per element (read A, B, write C)
7,680,000,000 L1-dcache-load-misses # 75% L1 miss rate!
The problem is B[k,j]: with B in column-major storage (C/C++ defaults to row-major, but this is for demonstration), accessing B[k,j] means j dimension is contiguous while k dimension has stride-N access. Each B[k,j] access misses the cache.
Change to IKJ order (swap middle and innermost loops):
for (int i = 0; i < M; i++)
for (int k = 0; k < K; k++)
for (int j = 0; j < N; j++)
C[i][j] += A[i][k] * B[k][j]; // Now j loop is innermost, B[k,j] is contiguous
$ perf stat -e L1-dcache-loads,L1-dcache-load-misses ./ikj_matmul
10,240,000,000 L1-dcache-loads
512,000,000 L1-dcache-load-misses # 5% L1 miss rate — 15× improvement!
This is the power of loop interchange: by changing loop order, we transform stride-N access into stride-1 (contiguous) access, significantly improving cache locality.
Tiling & Cache Blocking
But even with IKJ, when matrices are large (e.g., 8192×8192), the entire B matrix cannot fit in L1 cache (typical L1 size: 32-64 KB). The solution is Tiling (blocking): partition large matrices into small tiles that can fit in cache.
Tiling transforms the original three-level loops into six-level loops: outer three levels traverse tiles, inner three levels compute within tiles. By choosing appropriate tile sizes (typically 32×32 or 64×64), tiles can reside entirely in L1 cache, approaching theoretical peak performance.
Polyhedral Model Fundamentals
Iteration Space
An n-level nested loop corresponds to an n-dimensional iteration space:
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
S: C[i][j] = A[i][j] + B[i][j];
The iteration space of this loop is a 2D polyhedron (rectangle):
Each point corresponds to one execution of statement S (called a statement instance).
Affine Constraints
The polyhedral model only handles loops with affine constraints: loop bounds and array subscripts must be linear functions of iteration variables.
Legal affine loop:
for (int i = 0; i < N; i++)
for (int j = i; j < 2*i + N; j++)
A[i+j][2*i-j] = ...;
Iteration domain:
Access function:
Illegal non-affine loop:
for (int i = 0; i < N; i++)
for (int j = 0; j < i*i; j++) // i*i is quadratic, non-affine
A[i*j] = ...; // i*j is quadratic, non-affine
The restriction seems strict, but in practice 95%+ of ML workloads satisfy affine constraints: core computations of matmul, conv, pooling, layer norm, softmax can all be expressed as affine loops.
Dependence Relations
Dependence describes execution order constraints between two statement instances:
for (int i = 1; i < N; i++)
A[i] = A[i-1] + 1; // S
Statement S instance depends on instance because it reads A[i-1], which was written by the previous iteration. This is a flow dependence (RAW: Read After Write).
Represented as a dependence vector: , meaning “current iteration depends on previous iteration”.
In 2D iteration space:
for (int i = 1; i < N; i++)
for (int j = 1; j < N; j++)
A[i][j] = A[i-1][j-1] * 2; // S
Statement S instance depends on instance , dependence vector (diagonal dependence).
Dependence types:
- Flow (RAW): write then read, must preserve order
- Anti (WAR): read then write, can be eliminated via register renaming
- Output (WAW): two writes, must preserve write order
Mathematical Representation of Loop Transformations
Affine Transformation Matrix
A loop transformation can be represented as an affine transformation matrix :
where is an integer matrix (unimodular matrix, , ensuring invertibility), and is an offset vector.
Loop Interchange: swap loops 1 and 2
Skewing: tilt iteration space along diagonal, enabling wavefront parallelism
Tiling: split into outer loops (tile indices) and inner loops (intra-tile indices), implemented via strip-mining + permutation
where , and is the tile size.
Transformation Legality
Legality criterion: A loop transformation is legal if and only if it preserves all dependence relations: for each dependence vector , the transformed dependence vector must be lexicographically non-negative.
Lexicographically non-negative: A vector is lexicographically non-negative the first non-zero component .
Intuition: After transformation, the dependence source must execute before the dependence target. If with , the source is in an earlier iteration of the -th loop level, guaranteeing correct execution order.
Example: For dependence vector (i-dimension dependence):
-
Loop interchange :
Lexicographically non-negative (first component 0, second positive) → Legal -
Skewing :
Lexicographically non-negative (first component positive) → Legal
MLIR Affine Dialect
Core Concepts
MLIR’s Affine Dialect provides specialized IR for expressing affine loops and memory accesses.
Core operations:
affine.for: affine loopaffine.load/affine.store: memory access with affine subscriptsaffine.apply: apply affine mapaffine.if: affine conditional branchaffine.parallel: parallel loop
Affine Map: Affine maps are the core abstraction of the Affine Dialect, representing a set of affine expressions:
// (d0, d1)[s0] -> (d0 + s0, d1 * 2, d0 + d1)
// d0, d1 are dimension variables, s0 is a symbol
#map = affine_map<(d0, d1)[s0] -> (d0 + s0, d1 * 2, d0 + d1)>
Dimensions vs Symbols:
- Dimensions: correspond to loop iteration variables, treated as varying in dependence analysis
- Symbols: correspond to constant parameters (like matrix size N), treated as invariant in dependence analysis
Example: Matmul Affine IR
Naive matrix multiplication in Affine IR:
func.func @matmul(%A: memref<1024x1024xf32>,
%B: memref<1024x1024xf32>,
%C: memref<1024x1024xf32>) {
%c0 = arith.constant 0.0 : f32
affine.for %i = 0 to 1024 {
affine.for %j = 0 to 1024 {
affine.for %k = 0 to 1024 {
%a = affine.load %A[%i, %k] : memref<1024x1024xf32>
%b = affine.load %B[%k, %j] : memref<1024x1024xf32>
%c = affine.load %C[%i, %j] : memref<1024x1024xf32>
%prod = arith.mulf %a, %b : f32
%sum = arith.addf %c, %prod : f32
affine.store %sum, %C[%i, %j] : memref<1024x1024xf32>
}
}
}
return
}
Key information:
- All loop bounds are constants (0 to 1024), compiler can fully determine iteration space
- All memory access subscripts are affine expressions (
%i,%k,%j) - No side effects (except memref access), facilitating analysis and transformation
Affine Pass Pipeline
MLIR provides a series of Affine optimization passes (in mlir-opt):
mlir-opt matmul.mlir \
-affine-loop-fusion \ # Loop fusion
-affine-loop-tile="tile-size=32" \ # Tiling
-affine-loop-unroll="unroll-factor=4" \ # Loop unrolling
-affine-data-copy-generate \ # Insert explicit scratchpad copy
-affine-scalrep \ # Scalar replacement
-canonicalize -cse # Canonicalization + CSE
-affine-loop-fusion: fuses adjacent affine loops, reducing memory round-trips
// Before
affine.for %i = 0 to N {
%x = affine.load %A[%i]
affine.store %x, %B[%i]
}
affine.for %i = 0 to N {
%y = affine.load %B[%i]
%z = arith.mulf %y, %c : f32
affine.store %z, %C[%i]
}
// After
affine.for %i = 0 to N {
%x = affine.load %A[%i]
%z = arith.mulf %x, %c : f32
affine.store %z, %C[%i]
}
// B's intermediate result passed directly in register, no memory write-back
-affine-loop-tile: performs tiling transformation
// Before
affine.for %i = 0 to 1024 {
affine.for %j = 0 to 1024 {
// body
}
}
// After (tile-size=32)
affine.for %i0 = 0 to 1024 step 32 {
affine.for %j0 = 0 to 1024 step 32 {
affine.for %i1 = 0 to min(32, 1024-%i0) {
affine.for %j1 = 0 to min(32, 1024-%j0) {
// %i = %i0 + %i1, %j = %j0 + %j1
// body
}
}
}
}
-affine-data-copy-generate: generates explicit scratchpad buffers for tiles
// Before (tiled loop)
affine.for %i0 = 0 to 1024 step 32 {
affine.for %k0 = 0 to 1024 step 32 {
affine.for %i1 = 0 to 32 {
affine.for %k1 = 0 to 32 {
%a = affine.load %A[%i0+%i1, %k0+%k1] // Read from DRAM each time
// ...
}
}
}
}
// After
%A_tile = memref.alloc() : memref<32x32xf32> // L1 scratchpad
affine.for %i0 = 0 to 1024 step 32 {
affine.for %k0 = 0 to 1024 step 32 {
// DMA copy: A[i0:i0+32, k0:k0+32] -> A_tile
affine.for %i1 = 0 to 32 {
affine.for %k1 = 0 to 32 {
%a = affine.load %A_tile[%i1, %k1] // Read from L1
// ...
}
}
}
}
This is crucial for explicitly managed memory hierarchies like GPU shared memory and NPU SRAM.
Dependence Analysis Infrastructure
MLIR Affine’s core value lies in its built-in dependence analysis framework.
The compiler solves the following problem using Integer Set Library (ISL):
Problem: Given two memory accesses and , do they access the same address?
Mathematical formulation: Solve integer constraint system
If a solution exists, there is a dependence; the dependence vector is .
Example:
affine.for %i = 0 to N {
affine.for %j = 0 to N {
%1 = affine.load %A[%i + %j, %i - %j]
%2 = affine.load %A[%j, %i]
// Q: Can %1 and %2 access the same address?
}
}
Need to solve: , i.e., and
→ and , only one solution
→ Dependence exists, but only at point
ISL uses parametric integer programming techniques to efficiently solve such problems. Dependence analysis results directly guide:
- Loop transformation legality checking: do all dependences satisfy lexicographic order constraints?
- Loop fusion determination: are there dependences preventing fusion?
- Vectorization determination: are there cross-iteration dependences in the innermost loop?
Practical Loop Transformations
1. Loop Interchange
Goal: improve cache locality, transform stride-N access to stride-1.
Applicable scenarios:
- Array access patterns don’t match memory layout (e.g., column-major access to row-major arrays)
- Vectorization needs (move vectorizable dimension to innermost loop)
MLIR example:
// Before
affine.for %i = 0 to M {
affine.for %j = 0 to N {
%a = affine.load %A[%j, %i] // stride-M access
// ...
}
}
// After (interchange)
affine.for %j = 0 to N {
affine.for %i = 0 to M {
%a = affine.load %A[%j, %i] // stride-1 access (assuming row-major)
// ...
}
}
Pass usage: -affine-loop-interchange or manually with Transform Dialect
2. Loop Tiling
Goal: partition large iteration space into small blocks so each block’s data resides in cache.
Tile size selection principles:
- L1 tiling: tile size ~1/3 of L1 cache size (typically three arrays A, B, C)
- L2 tiling: tile outer loops again
- Register blocking: innermost micro-kernel tiling (e.g., 4×4 or 8×8)
MLIR Pass:
mlir-opt -affine-loop-tile="tile-sizes=32,32,32" matmul.mlir
Generated IR transforms three-level loops into six levels:
affine.for %i0 = 0 to 1024 step 32 {
affine.for %j0 = 0 to 1024 step 32 {
affine.for %k0 = 0 to 1024 step 32 {
affine.for %i1 = ... {
affine.for %j1 = ... {
affine.for %k1 = ... {
// body
}
}
}
}
}
}
3. Loop Unrolling
Goal: reduce loop control overhead, expose instruction-level parallelism (ILP).
MLIR Pass:
mlir-opt -affine-loop-unroll="unroll-factor=4" matmul.mlir
// Before
affine.for %i = 0 to N {
%x = affine.load %A[%i]
affine.store %x, %B[%i]
}
// After (unroll-factor=4)
affine.for %i = 0 to N step 4 {
%x0 = affine.load %A[%i]
affine.store %x0, %B[%i]
%x1 = affine.load %A[%i+1]
affine.store %x1, %B[%i+1]
%x2 = affine.load %A[%i+2]
affine.store %x2, %B[%i+2]
%x3 = affine.load %A[%i+3]
affine.store %x3, %B[%i+3]
}
// + tail loop for N % 4
After unrolling, CPU can execute multiple independent loads in parallel, hiding memory latency.
4. Loop Fusion
Goal: reduce intermediate array memory round-trips, improve temporal locality.
// Before
func.func @separate_loops(%A: memref<1024xf32>) {
%B = memref.alloc() : memref<1024xf32>
affine.for %i = 0 to 1024 {
%a = affine.load %A[%i]
%b = arith.mulf %a, %a : f32
affine.store %b, %B[%i]
}
affine.for %i = 0 to 1024 {
%b = affine.load %B[%i]
%c = arith.addf %b, %b : f32
affine.store %c, %A[%i]
}
return
}
// After (-affine-loop-fusion)
func.func @fused_loops(%A: memref<1024xf32>) {
affine.for %i = 0 to 1024 {
%a = affine.load %A[%i]
%b = arith.mulf %a, %a : f32
%c = arith.addf %b, %b : f32
affine.store %c, %A[%i]
}
return
}
// %B completely eliminated, intermediate result stays in register
Fusion legality conditions:
- Both loops have identical iteration domains (or can be aligned)
- Second loop cannot depend on cross-iteration data written by first loop
Polyhedral Applications & Limitations in ML
Application Scenarios
1. High-dimensional Tensor Operations
Conv, pooling, batch norm, etc., are high-dimensional affine loops where polyhedral optimization can automatically find optimal tiling strategies.
2. Tensor Contraction
Einstein summation (einsum) and generalized matrix multiplication (GEMM) are ideal application scenarios for polyhedral models. MLIR Linalg Dialect is designed based on this idea.
3. Automatic Scheduling
Projects like Tiramisu and Tensor Comprehensions attempt to use polyhedral models to automatically generate high-performance kernels.
4. NPU/Accelerator Codegen
Many custom hardware (TPU, Graphcore IPU, Cerebras WSE) have explicit memory hierarchies (SRAM, DRAM) and DMA control. Affine Dialect’s affine.dma_start / affine.dma_wait can directly model such hardware.
Limitations
1. Non-affine Code
Polyhedral model cannot handle:
- Indirect access (
A[B[i]]) - Data-dependent control flow (
if (A[i] > 0)) - Sparse matrix operations
These scenarios require other techniques (e.g., inspector-executor, sparse polyhedral model).
2. Compilation Time
ISL solving ILP problems can have high complexity (worst-case NP-hard). For ultra-large models (100+ layers), polyhedral analysis may become a compilation bottleneck.
3. Huge Heuristic Search Space
The combination space of tiling sizes, loop orders, unroll factors, etc., is exponential. While auto-tuning tools exist (e.g., Pluto), significant trial-and-error is still required.
4. Interaction with Other Optimizations
Polyhedral optimization interactions with backend optimizations (register allocation, instruction scheduling, vectorization) are complex. Over-unrolling may cause register spilling; over-tiling may increase loop control overhead.
MLIR Transform Dialect Preview
MLIR’s Transform Dialect provides a programmable transformation framework, allowing users to describe optimization strategies using MLIR IR itself.
// Transform script
transform.sequence failures(propagate) {
^bb0(%arg0: !transform.any_op):
%matmul = transform.structured.match ops{["linalg.matmul"]} in %arg0
: (!transform.any_op) -> !transform.any_op
// Tile outer two levels (32x32)
%tiled, %loops:2 = transform.structured.tile %matmul [32, 32, 0]
: (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op)
// Vectorize innermost level
%vectorized = transform.structured.vectorize %tiled
: (!transform.any_op) -> !transform.any_op
transform.yield
}
This declaratively expresses matmul tiling + vectorization strategy as IR, enabling:
- Version management and reuse (save tuning strategies as
.mlirfiles) - Python bindings for autotuning
- Pattern matching to apply different strategies to different IR structures
Transform Dialect is a core innovation distinguishing MLIR from traditional compilers, making compilation strategies first-class citizens rather than hard-coded passes.
Summary
Polyhedral optimization is one of the most powerful compiler techniques, formalizing loop transformations as integer linear programming problems, enabling automated optimization.
Core ideas:
- Iteration space modeling: represent nested loops as polyhedra
- Dependence analysis: describe data dependences with affine constraints
- Transformation legality: ensure correctness via lexicographic ordering checks
- Optimization goals: optimize cache locality and parallelism through tiling, interchange, fusion, etc.
MLIR Affine Dialect value:
- Provides unified IR representation (affine.for, affine.load)
- Built-in dependence analysis framework (based on ISL)
- Composable pass pipeline (tiling → fusion → unroll → vectorize)
- Transform Dialect makes optimization strategies programmable
In ML compilers, polyhedral optimization is mainly applied to dense tensor operation kernel generation (matmul, conv, pooling). Together with other techniques (operator fusion, memory planning, instruction selection), it forms a complete compilation pipeline.
Further Reading
Tutorials:
- MLIR Affine Dialect Documentation
- MLIR: Scaling Compiler Infrastructure for Domain Specific Computation (paper)
Classic Papers:
- A Practical Automatic Polyhedral Parallelizer (Pluto)
- Optimizing Compilers for Modern Architectures (Allen & Kennedy classic textbook)
Advanced Topics:
- Polly - Polyhedral optimizations for LLVM
- ISL: An Integer Set Library for the Polyhedral Model
- MLIR Transform Dialect Tutorial
- Tiramisu: A Polyhedral Compiler for Dense and Sparse Deep Learning
Related Tools: