Graph Optimization Passes (Part 2): Polyhedral Optimization & Loop Transformations

Introduction

The core computation of deep learning models consists of dense matrix operations — matmul, conv, layer norm. These operations share a common trait: they are essentially nested loops that iterate regularly over multi-dimensional arrays.

# Naive matrix multiplication: three nested loops
for i in range(M):
    for j in range(N):
        for k in range(K):
            C[i, j] += A[i, k] * B[k, j]

This code is extremely inefficient on modern hardware. The reason: modern CPU/GPU performance bottlenecks have shifted from computation to memory access. L1 cache access latency is 4 cycles, DRAM access latency is 200+ cycles. Naive loop implementations cause massive cache misses: each access to B[k,j] results in poor cache line utilization due to stride access in the j dimension.

Polyhedral Optimization is a compiler technique specialized for nested loops. Its core idea: model the loop’s iteration space and dependence relations as polyhedra, and use linear algebra and integer linear programming (ILP) tools to automatically find optimal loop transformations.

This theory matured in the 1980s-90s (work by Banerjee, Wolfe, Lam, etc.) but remained largely in academia and HPC. MLIR’s Affine Dialect brings polyhedral models into mainstream ML compiler infrastructure for the first time. This article deeply analyzes the mathematical foundations and engineering practices of this technique.

Why Loop Transformations: Cache Locality & Parallelism

Performance Issues of Naive Implementation

A simple matmul performance comparison (1024×1024 matrices, single-threaded, Intel i7-12700K):

Implementation	Performance (GFLOPS)	Relative Speedup
Naive IJK loops	2.3	1× baseline
Loop Interchange (IKJ)	15.7	6.8×
Tiling (32×32)	42.1	18.3×
Tiling + SIMD	118.6	51.6×
cuBLAS (GPU)	15,320	6,661×

The naive implementation achieves only 0.1% of theoretical peak. Through loop transformations (interchange + tiling), we gain 18× speedup, reaching ~2% of peak. The remaining gap comes from SIMD vectorization, register blocking, software pipelining, and other lower-level optimizations.

Cache Miss Analysis

Why is naive IJK so inefficient? Analyze cache behavior with perf:

$ perf stat -e L1-dcache-loads,L1-dcache-load-misses ./naive_matmul
  10,240,000,000 L1-dcache-loads    # 3 accesses per element (read A, B, write C)
   7,680,000,000 L1-dcache-load-misses  # 75% L1 miss rate!

The problem is B[k,j]: with B in column-major storage (C/C++ defaults to row-major, but this is for demonstration), accessing B[k,j] means j dimension is contiguous while k dimension has stride-N access. Each B[k,j] access misses the cache.

Change to IKJ order (swap middle and innermost loops):

for (int i = 0; i < M; i++)
    for (int k = 0; k < K; k++)
        for (int j = 0; j < N; j++)
            C[i][j] += A[i][k] * B[k][j];  // Now j loop is innermost, B[k,j] is contiguous

$ perf stat -e L1-dcache-loads,L1-dcache-load-misses ./ikj_matmul
  10,240,000,000 L1-dcache-loads
     512,000,000 L1-dcache-load-misses  # 5% L1 miss rate — 15× improvement!

This is the power of loop interchange: by changing loop order, we transform stride-N access into stride-1 (contiguous) access, significantly improving cache locality.

Tiling & Cache Blocking

But even with IKJ, when matrices are large (e.g., 8192×8192), the entire B matrix cannot fit in L1 cache (typical L1 size: 32-64 KB). The solution is Tiling (blocking): partition large matrices into small tiles that can fit in cache.

Tiling transforms the original three-level loops into six-level loops: outer three levels traverse tiles, inner three levels compute within tiles. By choosing appropriate tile sizes (typically 32×32 or 64×64), tiles can reside entirely in L1 cache, approaching theoretical peak performance.

Polyhedral Model Fundamentals

Iteration Space

An n-level nested loop corresponds to an n-dimensional iteration space:

for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++)
        S: C[i][j] = A[i][j] + B[i][j];

The iteration space of this loop is a 2D polyhedron (rectangle):

\mathcal{D} = \{ (i, j) \mid 0 \le i < M, \, 0 \le j < N \}

Each point $(i, j)$ corresponds to one execution of statement S (called a statement instance).

Affine Constraints

The polyhedral model only handles loops with affine constraints: loop bounds and array subscripts must be linear functions of iteration variables.

Legal affine loop:

for (int i = 0; i < N; i++)
    for (int j = i; j < 2*i + N; j++)
        A[i+j][2*i-j] = ...;

Iteration domain: $\mathcal{D} = \{ (i,j) \mid 0 \le i < N, \, i \le j < 2i+N \}$
Access function: $f_A(i,j) = (i+j, 2i-j)$

Illegal non-affine loop:

for (int i = 0; i < N; i++)
    for (int j = 0; j < i*i; j++)  // i*i is quadratic, non-affine
        A[i*j] = ...;              // i*j is quadratic, non-affine

The restriction seems strict, but in practice 95%+ of ML workloads satisfy affine constraints: core computations of matmul, conv, pooling, layer norm, softmax can all be expressed as affine loops.

Dependence Relations

Dependence describes execution order constraints between two statement instances:

for (int i = 1; i < N; i++)
    A[i] = A[i-1] + 1;  // S

Statement S instance $(i)$ depends on instance $(i-1)$ because it reads A[i-1], which was written by the previous iteration. This is a flow dependence (RAW: Read After Write).

Represented as a dependence vector: $\mathbf{d} = (1)$ , meaning “current iteration depends on previous iteration”.

In 2D iteration space:

for (int i = 1; i < N; i++)
    for (int j = 1; j < N; j++)
        A[i][j] = A[i-1][j-1] * 2;  // S

Statement S instance $(i,j)$ depends on instance $(i-1, j-1)$ , dependence vector $\mathbf{d} = (1, 1)$ (diagonal dependence).

Dependence types:

Flow (RAW): write then read, must preserve order
Anti (WAR): read then write, can be eliminated via register renaming
Output (WAW): two writes, must preserve write order

Mathematical Representation of Loop Transformations

Affine Transformation Matrix

A loop transformation can be represented as an affine transformation matrix $\mathbf{T}$ :

\begin{pmatrix} i' \\ j' \end{pmatrix} = \mathbf{T} \begin{pmatrix} i \\ j \end{pmatrix} + \mathbf{b}

where $\mathbf{T}$ is an $n \times n$ integer matrix (unimodular matrix, $|\det \mathbf{T}| = 1$ , ensuring invertibility), and $\mathbf{b}$ is an offset vector.

Loop Interchange: swap loops 1 and 2

\mathbf{T}_{\text{interchange}} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}

$(i, j) \to (j, i)$

Skewing: tilt iteration space along diagonal, enabling wavefront parallelism

\mathbf{T}_{\text{skew}} = \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix}

$(i, j) \to (i, i+j)$

Tiling: split into outer loops (tile indices) and inner loops (intra-tile indices), implemented via strip-mining + permutation

(i, j) \to (i_{\text{outer}}, j_{\text{outer}}, i_{\text{inner}}, j_{\text{inner}})

where $i = i_{\text{outer}} \cdot T_i + i_{\text{inner}}$ , and $T_i$ is the tile size.

Transformation Legality

Legality criterion: A loop transformation $\mathbf{T}$ is legal if and only if it preserves all dependence relations: for each dependence vector $\mathbf{d}$ , the transformed dependence vector $\mathbf{d}' = \mathbf{T} \mathbf{d}$ must be lexicographically non-negative.

Lexicographically non-negative: A vector $(d_1, d_2, \ldots, d_n)$ is lexicographically non-negative $\iff$ the first non-zero component $d_i > 0$ .

Intuition: After transformation, the dependence source must execute before the dependence target. If $\mathbf{d}' = (0, 0, \ldots, 0, d_k, \ldots)$ with $d_k > 0$ , the source is in an earlier iteration of the $k$ -th loop level, guaranteeing correct execution order.

Example: For dependence vector $\mathbf{d} = (1, 0)$ (i-dimension dependence):

Loop interchange $\mathbf{T} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$ :
$\mathbf{d}' = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}$
Lexicographically non-negative (first component 0, second positive) → Legal
Skewing $\mathbf{T} = \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix}$ :
$\mathbf{d}' = \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \end{pmatrix}$
Lexicographically non-negative (first component positive) → Legal

MLIR Affine Dialect

Core Concepts

MLIR’s Affine Dialect provides specialized IR for expressing affine loops and memory accesses.

Core operations:

affine.for: affine loop
affine.load / affine.store: memory access with affine subscripts
affine.apply: apply affine map
affine.if: affine conditional branch
affine.parallel: parallel loop

Affine Map: Affine maps are the core abstraction of the Affine Dialect, representing a set of affine expressions:

// (d0, d1)[s0] -> (d0 + s0, d1 * 2, d0 + d1)
// d0, d1 are dimension variables, s0 is a symbol
#map = affine_map<(d0, d1)[s0] -> (d0 + s0, d1 * 2, d0 + d1)>

Dimensions vs Symbols:

Dimensions: correspond to loop iteration variables, treated as varying in dependence analysis
Symbols: correspond to constant parameters (like matrix size N), treated as invariant in dependence analysis

Example: Matmul Affine IR

Naive matrix multiplication in Affine IR:

func.func @matmul(%A: memref<1024x1024xf32>, 
                  %B: memref<1024x1024xf32>,
                  %C: memref<1024x1024xf32>) {
  %c0 = arith.constant 0.0 : f32
  affine.for %i = 0 to 1024 {
    affine.for %j = 0 to 1024 {
      affine.for %k = 0 to 1024 {
        %a = affine.load %A[%i, %k] : memref<1024x1024xf32>
        %b = affine.load %B[%k, %j] : memref<1024x1024xf32>
        %c = affine.load %C[%i, %j] : memref<1024x1024xf32>
        %prod = arith.mulf %a, %b : f32
        %sum = arith.addf %c, %prod : f32
        affine.store %sum, %C[%i, %j] : memref<1024x1024xf32>
      }
    }
  }
  return
}

Key information:

All loop bounds are constants (0 to 1024), compiler can fully determine iteration space
All memory access subscripts are affine expressions (%i, %k, %j)
No side effects (except memref access), facilitating analysis and transformation

Affine Pass Pipeline

MLIR provides a series of Affine optimization passes (in mlir-opt):

mlir-opt matmul.mlir \
  -affine-loop-fusion \           # Loop fusion
  -affine-loop-tile="tile-size=32" \  # Tiling
  -affine-loop-unroll="unroll-factor=4" \  # Loop unrolling
  -affine-data-copy-generate \    # Insert explicit scratchpad copy
  -affine-scalrep \                # Scalar replacement
  -canonicalize -cse              # Canonicalization + CSE

-affine-loop-fusion: fuses adjacent affine loops, reducing memory round-trips

// Before
affine.for %i = 0 to N {
  %x = affine.load %A[%i]
  affine.store %x, %B[%i]
}
affine.for %i = 0 to N {
  %y = affine.load %B[%i]
  %z = arith.mulf %y, %c : f32
  affine.store %z, %C[%i]
}

// After
affine.for %i = 0 to N {
  %x = affine.load %A[%i]
  %z = arith.mulf %x, %c : f32
  affine.store %z, %C[%i]
}
// B's intermediate result passed directly in register, no memory write-back

-affine-loop-tile: performs tiling transformation

// Before
affine.for %i = 0 to 1024 {
  affine.for %j = 0 to 1024 {
    // body
  }
}

// After (tile-size=32)
affine.for %i0 = 0 to 1024 step 32 {
  affine.for %j0 = 0 to 1024 step 32 {
    affine.for %i1 = 0 to min(32, 1024-%i0) {
      affine.for %j1 = 0 to min(32, 1024-%j0) {
        // %i = %i0 + %i1, %j = %j0 + %j1
        // body
      }
    }
  }
}

-affine-data-copy-generate: generates explicit scratchpad buffers for tiles

// Before (tiled loop)
affine.for %i0 = 0 to 1024 step 32 {
  affine.for %k0 = 0 to 1024 step 32 {
    affine.for %i1 = 0 to 32 {
      affine.for %k1 = 0 to 32 {
        %a = affine.load %A[%i0+%i1, %k0+%k1]  // Read from DRAM each time
        // ...
      }
    }
  }
}

// After
%A_tile = memref.alloc() : memref<32x32xf32>  // L1 scratchpad
affine.for %i0 = 0 to 1024 step 32 {
  affine.for %k0 = 0 to 1024 step 32 {
    // DMA copy: A[i0:i0+32, k0:k0+32] -> A_tile
    affine.for %i1 = 0 to 32 {
      affine.for %k1 = 0 to 32 {
        %a = affine.load %A_tile[%i1, %k1]  // Read from L1
        // ...
      }
    }
  }
}

This is crucial for explicitly managed memory hierarchies like GPU shared memory and NPU SRAM.

Dependence Analysis Infrastructure

MLIR Affine’s core value lies in its built-in dependence analysis framework.

The compiler solves the following problem using Integer Set Library (ISL):

Problem: Given two memory accesses $A[\mathbf{f}_1(\mathbf{i})]$ and $A[\mathbf{f}_2(\mathbf{j})]$ , do they access the same address?

Mathematical formulation: Solve integer constraint system

\begin{cases} \mathbf{i} \in \mathcal{D}_1 \\ \mathbf{j} \in \mathcal{D}_2 \\ \mathbf{f}_1(\mathbf{i}) = \mathbf{f}_2(\mathbf{j}) \end{cases}

If a solution exists, there is a dependence; the dependence vector is $\mathbf{d} = \mathbf{j} - \mathbf{i}$ .

Example:

affine.for %i = 0 to N {
  affine.for %j = 0 to N {
    %1 = affine.load %A[%i + %j, %i - %j]
    %2 = affine.load %A[%j, %i]
    // Q: Can %1 and %2 access the same address?
  }
}

Need to solve: $(i+j, i-j) = (j, i)$ , i.e., $i+j=j$ and $i-j=i$
→ $i=0$ and $j=0$ , only one solution
→ Dependence exists, but only at point $(0,0)$

ISL uses parametric integer programming techniques to efficiently solve such problems. Dependence analysis results directly guide:

Loop transformation legality checking: do all dependences satisfy lexicographic order constraints?
Loop fusion determination: are there dependences preventing fusion?
Vectorization determination: are there cross-iteration dependences in the innermost loop?

Practical Loop Transformations

1. Loop Interchange

Goal: improve cache locality, transform stride-N access to stride-1.

Applicable scenarios:

Array access patterns don’t match memory layout (e.g., column-major access to row-major arrays)
Vectorization needs (move vectorizable dimension to innermost loop)

MLIR example:

// Before
affine.for %i = 0 to M {
  affine.for %j = 0 to N {
    %a = affine.load %A[%j, %i]  // stride-M access
    // ...
  }
}

// After (interchange)
affine.for %j = 0 to N {
  affine.for %i = 0 to M {
    %a = affine.load %A[%j, %i]  // stride-1 access (assuming row-major)
    // ...
  }
}

Pass usage: -affine-loop-interchange or manually with Transform Dialect

2. Loop Tiling

Goal: partition large iteration space into small blocks so each block’s data resides in cache.

Tile size selection principles:

L1 tiling: tile size ~1/3 of L1 cache size (typically three arrays A, B, C)
L2 tiling: tile outer loops again
Register blocking: innermost micro-kernel tiling (e.g., 4×4 or 8×8)

MLIR Pass:

mlir-opt -affine-loop-tile="tile-sizes=32,32,32" matmul.mlir

Generated IR transforms three-level loops into six levels:

affine.for %i0 = 0 to 1024 step 32 {
  affine.for %j0 = 0 to 1024 step 32 {
    affine.for %k0 = 0 to 1024 step 32 {
      affine.for %i1 = ... {
        affine.for %j1 = ... {
          affine.for %k1 = ... {
            // body
          }
        }
      }
    }
  }
}

3. Loop Unrolling

Goal: reduce loop control overhead, expose instruction-level parallelism (ILP).

MLIR Pass:

mlir-opt -affine-loop-unroll="unroll-factor=4" matmul.mlir

// Before
affine.for %i = 0 to N {
  %x = affine.load %A[%i]
  affine.store %x, %B[%i]
}

// After (unroll-factor=4)
affine.for %i = 0 to N step 4 {
  %x0 = affine.load %A[%i]
  affine.store %x0, %B[%i]
  %x1 = affine.load %A[%i+1]
  affine.store %x1, %B[%i+1]
  %x2 = affine.load %A[%i+2]
  affine.store %x2, %B[%i+2]
  %x3 = affine.load %A[%i+3]
  affine.store %x3, %B[%i+3]
}
// + tail loop for N % 4

After unrolling, CPU can execute multiple independent loads in parallel, hiding memory latency.

4. Loop Fusion

Goal: reduce intermediate array memory round-trips, improve temporal locality.

// Before
func.func @separate_loops(%A: memref<1024xf32>) {
  %B = memref.alloc() : memref<1024xf32>
  affine.for %i = 0 to 1024 {
    %a = affine.load %A[%i]
    %b = arith.mulf %a, %a : f32
    affine.store %b, %B[%i]
  }
  affine.for %i = 0 to 1024 {
    %b = affine.load %B[%i]
    %c = arith.addf %b, %b : f32
    affine.store %c, %A[%i]
  }
  return
}

// After (-affine-loop-fusion)
func.func @fused_loops(%A: memref<1024xf32>) {
  affine.for %i = 0 to 1024 {
    %a = affine.load %A[%i]
    %b = arith.mulf %a, %a : f32
    %c = arith.addf %b, %b : f32
    affine.store %c, %A[%i]
  }
  return
}
// %B completely eliminated, intermediate result stays in register

Fusion legality conditions:

Both loops have identical iteration domains (or can be aligned)
Second loop cannot depend on cross-iteration data written by first loop

Polyhedral Applications & Limitations in ML

Application Scenarios

1. High-dimensional Tensor Operations
Conv, pooling, batch norm, etc., are high-dimensional affine loops where polyhedral optimization can automatically find optimal tiling strategies.

2. Tensor Contraction
Einstein summation (einsum) and generalized matrix multiplication (GEMM) are ideal application scenarios for polyhedral models. MLIR Linalg Dialect is designed based on this idea.

3. Automatic Scheduling
Projects like Tiramisu and Tensor Comprehensions attempt to use polyhedral models to automatically generate high-performance kernels.

4. NPU/Accelerator Codegen
Many custom hardware (TPU, Graphcore IPU, Cerebras WSE) have explicit memory hierarchies (SRAM, DRAM) and DMA control. Affine Dialect’s affine.dma_start / affine.dma_wait can directly model such hardware.

Limitations

1. Non-affine Code
Polyhedral model cannot handle:

Indirect access (A[B[i]])
Data-dependent control flow (if (A[i] > 0))
Sparse matrix operations

These scenarios require other techniques (e.g., inspector-executor, sparse polyhedral model).

2. Compilation Time
ISL solving ILP problems can have high complexity (worst-case NP-hard). For ultra-large models (100+ layers), polyhedral analysis may become a compilation bottleneck.

3. Huge Heuristic Search Space
The combination space of tiling sizes, loop orders, unroll factors, etc., is exponential. While auto-tuning tools exist (e.g., Pluto), significant trial-and-error is still required.

4. Interaction with Other Optimizations
Polyhedral optimization interactions with backend optimizations (register allocation, instruction scheduling, vectorization) are complex. Over-unrolling may cause register spilling; over-tiling may increase loop control overhead.

MLIR Transform Dialect Preview

MLIR’s Transform Dialect provides a programmable transformation framework, allowing users to describe optimization strategies using MLIR IR itself.

// Transform script
transform.sequence failures(propagate) {
^bb0(%arg0: !transform.any_op):
  %matmul = transform.structured.match ops{["linalg.matmul"]} in %arg0
    : (!transform.any_op) -> !transform.any_op
  
  // Tile outer two levels (32x32)
  %tiled, %loops:2 = transform.structured.tile %matmul [32, 32, 0]
    : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op)
  
  // Vectorize innermost level
  %vectorized = transform.structured.vectorize %tiled
    : (!transform.any_op) -> !transform.any_op
  
  transform.yield
}

This declaratively expresses matmul tiling + vectorization strategy as IR, enabling:

Version management and reuse (save tuning strategies as .mlir files)
Python bindings for autotuning
Pattern matching to apply different strategies to different IR structures

Transform Dialect is a core innovation distinguishing MLIR from traditional compilers, making compilation strategies first-class citizens rather than hard-coded passes.

Summary

Polyhedral optimization is one of the most powerful compiler techniques, formalizing loop transformations as integer linear programming problems, enabling automated optimization.

Core ideas:

Iteration space modeling: represent nested loops as polyhedra
Dependence analysis: describe data dependences with affine constraints
Transformation legality: ensure correctness via lexicographic ordering checks
Optimization goals: optimize cache locality and parallelism through tiling, interchange, fusion, etc.

MLIR Affine Dialect value:

Provides unified IR representation (affine.for, affine.load)
Built-in dependence analysis framework (based on ISL)
Composable pass pipeline (tiling → fusion → unroll → vectorize)
Transform Dialect makes optimization strategies programmable

In ML compilers, polyhedral optimization is mainly applied to dense tensor operation kernel generation (matmul, conv, pooling). Together with other techniques (operator fusion, memory planning, instruction selection), it forms a complete compilation pipeline.