Autotuning and End-to-End Practice

Introduction

Over the course of the previous 16 articles, we have traversed the complete ML compiler stack: from graph capture to IR design, from optimization passes to operator fusion, from tiling and memory optimization to code generation, and finally to scheduling and execution optimization. Each step addresses a core question: how to make GPUs execute deep learning computations more efficiently.

But after all these optimizations are in place, one ultimate challenge remains: How does the compiler know which parameter combinations are optimal?

A Triton matmul kernel has 5-8 tunable parameters (BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages, etc.), each with 3-6 reasonable values. Combined, these may produce thousands of configurations. On an A100, the optimal configuration might be BLOCK_M=128, BLOCK_N=128, num_warps=4; but on an H100, with its larger SMEM and different Tensor Core architecture, the optimum might shift to BLOCK_M=256, BLOCK_N=128, num_stages=5. Static cost models cannot fully capture these hardware differences — SMEM bank conflicts, L2 cache behavior, and TLB misses are microarchitectural details that are extremely difficult to model precisely.

Autotuning addresses this by trial-running candidate configurations on actual hardware, using real benchmark results to make decisions. While this might seem brute-force, it is the most reliable engineering approach in practice. Triton’s @triton.autotune, TVM’s AutoScheduler (Ansor), and MLIR’s Transform Dialect are all different implementations of this philosophy.

As the capstone article of the entire Graph Compilation and Optimization learning path, this article will dive deep into the principles and practice of autotuning, introduce MLIR Transform Dialect’s programmable scheduling paradigm, share practical debugging techniques for torch.compile, and finally link all 17 articles together through a complete end-to-end case study.

Why Autotuning Is Necessary

The Combinatorial Explosion Problem

Let us quantify the scale of the search space. For a typical matrix multiplication kernel, the tunable parameters include:

Parameter	Meaning	Typical Values	Options
BLOCK_M	M-dimension tile size	32, 64, 128, 256	4
BLOCK_N	N-dimension tile size	32, 64, 128, 256	4
BLOCK_K	K-dimension tile size	16, 32, 64	3
num_warps	Number of warps	2, 4, 8	3
num_stages	Pipeline stages	2, 3, 4, 5	4

These 5 parameters alone produce $4 \times 4 \times 3 \times 3 \times 4 = 576$ combinations. Adding parameters like SPLIT_K (2-8) and GROUP_M (1-8) easily pushes the search space beyond 10,000. A complete model may contain dozens of kernel shapes, each requiring independent tuning — this is a classic combinatorial explosion problem.

Hardware Differences Cannot Be Ignored

Even within the same GPU generation, microarchitectural differences between models significantly affect optimal configurations:

A100 (80GB): 192 KB SMEM per SM, 4 warp schedulers, 108 SMs
H100 (80GB): 228 KB SMEM per SM, 4 warp schedulers + TMA engine, 132 SMs
MI300X: 64 KB LDS per CU, different wavefront scheduling, 304 CUs

SMEM size directly determines the upper bound on tile sizes: larger SMEM allows larger tiles, reducing global memory access frequency. H100’s TMA (Tensor Memory Accelerator) allows more pipeline stages to benefit, since TMA asynchronous prefetching does not consume warp resources. On AMD MI300X, the LDS (Local Data Share, equivalent to SMEM) size and bank structure are entirely different, requiring independent tuning.

Limitations of Cost Models

In theory, we could build an analytical cost model to predict kernel performance without actual execution. In practice, this approach faces three major challenges:

Cache behavior is hard to model: L2 cache hit rates depend on tile traversal order, interference from concurrent kernels, and hardware prefetcher behavior — these factors interact in complex ways.
Instruction-level pipeline effects: The compiler backend’s instruction scheduling produces different pipeline stall patterns for different tile sizes. For example, a seemingly larger tile might cause register spilling due to register pressure, and spilling to local memory has far higher latency than register access.
Bank conflicts and SMEM padding: SMEM bank conflicts depend on the precise alignment of tile layout and access patterns. A BLOCK_K=32 configuration might have zero bank conflicts, while BLOCK_K=64 might suffer 4-way conflicts due to strided access — yielding up to 30% performance difference.

Therefore, the most effective approach in practice is: use cost models for initial pruning to narrow the search space to a reasonable range, then use autotuning for the final selection.

Triton’s Autotune Mechanism

The @triton.autotune Decorator

Triton provides an elegant autotuning API. Developers simply declare candidate configurations with the @triton.autotune decorator:

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32},
                      num_warps=4, num_stages=3),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 32},
                      num_warps=8, num_stages=3),
        triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 32},
                      num_warps=8, num_stages=3),
        triton.Config({'BLOCK_M': 256, 'BLOCK_N': 256, 'BLOCK_K': 64},
                      num_warps=8, num_stages=4),
        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 32},
                      num_warps=4, num_stages=5),
    ],
    key=['M', 'N', 'K'],  # Re-tune when these dimensions change
)
@triton.jit
def matmul_kernel(
    A, B, C,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_M: tl.constexpr,
    BLOCK_N: tl.constexpr,
    BLOCK_K: tl.constexpr,
):
    # kernel implementation...

The key parameter is critical: it specifies which runtime parameters affect the optimal configuration. When M, N, K change (for example, switching from training’s large batches to inference’s small batches), Triton re-runs autotuning. Tuned results are cached to ~/.triton/cache/.

Warmup and Benchmark Flow

The actual execution flow of Triton autotune is as follows:

Compile all candidate kernels: Each Config is fully compiled to PTX then cubin. This can be expensive — 5 configs x 2 seconds/config = 10 seconds of compile time.
Warmup runs: Each kernel runs warmup times (default 25) to stabilize GPU state (fill caches, warm up clock frequencies).
Benchmark runs: Then runs rep times (default 100), taking the median time.
Select best: Compare median times across all configurations, select the fastest.
Cache result: Persist the (key, best_config) mapping to disk cache.

On subsequent calls with the same shape, the optimal configuration is read directly from cache, skipping all compilation and benchmarking overhead.

Compilation Overhead and Caching Strategies

A major pain point of autotuning is first-time compilation. For a model containing 50 different shapes:

Each shape x 5 candidate configs = 250 compilations
Each compilation takes ~1-3 seconds (Triton -> TTIR -> TTGIR -> LLVM IR -> PTX -> cubin)
Total: 250-750 seconds (4-12 minutes) of first-time compilation

This is unacceptable in production environments. Practical solutions include:

Ahead-of-time (AOT) compilation: Pre-compile all candidate configurations before deployment
Cache warming: Distribute autotuning caches as part of the model artifact
Configuration inheritance: Reuse optimal configs for similar shapes, only re-tuning for significantly different cases

The interactive component below lets you explore how different parameter combinations affect performance:

In this heatmap, you can observe several key patterns:

The sweet spot is typically at medium tile sizes (e.g., 128x128) — too small leads to low Tensor Core utilization (insufficient compute per warp), too large reduces occupancy (excessive SMEM usage)
num_warps and tile size should match: Large tiles need more warps for parallel processing; too many warps for small tiles wastes resources
num_stages matters more for memory-bound kernels: Pipeline prefetching hides latency while waiting for global memory

Search Strategies

Grid Search

The simplest strategy: exhaustively evaluate all candidate configurations. The advantage is guaranteed global optimum; the disadvantage is linear time cost with search space size. For 576 configurations at 0.5 seconds per benchmark, total time is about 5 minutes. This is acceptable during development but too slow for deployment across multiple hardware targets.

Triton’s @triton.autotune is essentially Grid Search — it iterates through all manually specified Config entries. Developers typically pre-filter 5-15 “reasonable” configurations based on experience, rather than enumerating all permutations.

Random Search

The classic paper by Bergstra & Bengio (2012) demonstrated that for high-dimensional parameter spaces, random search is typically more efficient than grid search. The reason is that most parameters have uneven influence — perhaps only BLOCK_M and BLOCK_N decisively affect performance, while num_stages has minor impact. Grid Search wastes many sample points on unimportant dimensions, while Random Search achieves denser coverage on important dimensions.

In practice, randomly sampling 30 configurations finds a top-5% configuration with over 95% probability. Early versions of TVM (AutoTVM) relied heavily on this strategy.

Bayesian Optimization

Bayesian Optimization (BO) is a smarter sequential search strategy:

Initial sampling: Randomly select 5-10 configurations for benchmarking
Build surrogate model: Fit a Gaussian Process (GP) or Tree-structured Parzen Estimator (TPE) to observed (configuration, performance) data
Acquisition function: Use the surrogate model’s predicted mean and uncertainty to select the next configuration most likely to improve results — balancing “exploit” (search near known good regions) and “explore” (investigate unknown regions)
Iterate: Benchmark the new configuration, update the surrogate model, repeat until budget is exhausted

BO’s advantage is sample efficiency: typically only 15-30 benchmarks are needed to find a near-optimal configuration, while Grid Search may require hundreds. The overhead lies in training the surrogate model and solving the acquisition function — for low-dimensional spaces (5-8 parameters), this cost is negligible.

TVM’s AutoScheduler (Ansor) uses a hybrid strategy combining cost-model-guided search with evolutionary algorithms. Ansor first uses a learned cost model (based on XGBoost) to predict performance, generating many candidate schedules, then only benchmarks the top-k. This approach finds near-optimal solutions among 5000+ possible schedules while benchmarking only about 100.

Transfer Learning and Cost-Model-Guided Search

More advanced methods leverage transfer learning to accelerate tuning. The core observation is that while optimal configurations differ across hardware, the relative ranking of configurations is correlated. If configuration A is 20% faster than B on A100, then A is likely also faster than B on H100 (though the magnitude may differ).

TVM’s MetaSchedule exploits this property: after tuning on one GPU, it transfers the cost model to a new GPU as initialization, requiring only a few benchmarks to adapt. This reduces cross-hardware tuning time from hours to minutes.

MLIR Transform Dialect

The Programmable Scheduling Paradigm

The autotuning strategies discussed above (Grid Search, BO, etc.) all tune a set of fixed-dimensional numerical parameters (tile size, warp count, etc.). But many compiler optimization decisions are structural: should two operations be fused? Tile first or vectorize first? Which loop permutation to choose?

MLIR’s Transform Dialect offers a fundamentally different approach: expressing optimization strategies as IR. Developers write a “schedule script” using Transform Dialect operations to declare optimization steps. The compiler mechanically executes transformations according to this schedule, without relying on heuristic rules.

This paradigm is directly inspired by Halide’s schedule language, but Transform Dialect, as part of MLIR, has several unique advantages:

Type safety: Each transform op has a precise type signature; the compiler can verify schedule legality before execution
Composability: Multiple transforms compose freely, with explicit preconditions and postconditions for each
Debuggability: Every step of schedule execution is traceable; on failure, the exact transform op that failed can be pinpointed

Core Transform Operations

Key Transform Dialect operations include:

transform.structured.match: Matches target operations (e.g., linalg.matmul) in the IR, returning a handle
transform.structured.tile_using_for: Tiles the matched operation, generating scf.for loops
transform.structured.fuse_into_containing_op: Fuses one operation into another’s loop body — this is key to epilogue fusion
transform.structured.vectorize: Converts scalar operations to vector operations (e.g., linalg.matmul to vector.contract)
transform.bufferization.one_shot_bufferize: Converts tensor-semantic IR to buffer (memref) semantics — the critical step from mathematical abstraction to actual memory operations

The interactive component below demonstrates how three different schedules progressively optimize the same matmul + relu computation:

Complementarity with Polyhedral

Transform Dialect and polyhedral compilation are complementary optimization approaches:

Polyhedral: Automatically analyzes loop dependencies to find optimal tile/permute/parallelize strategies. Advantage: fully automatic. Disadvantage: high analysis complexity, may not find optimal solutions
Transform Dialect: Developer explicitly specifies optimization strategies. Advantage: controllable, debuggable. Disadvantage: requires human expertise

In practice, the most effective approach is: use Polyhedral analysis to suggest schedules, then use Transform Dialect to execute and fine-tune them. IREE (Google’s ML compiler) does exactly this: its codegen pipeline uses Transform Dialect to drive the entire lowering process from linalg-on-tensor to vector to GPU.

Practical Application in IREE

IREE’s use of Transform Dialect demonstrates this technology’s maturity in production environments. A typical IREE codegen pipeline includes:

Tile to workgroups: Partition computation into GPU workgroup-level tiles
Tile to threads: Further tile within workgroups to thread level
Vectorize: Convert scalar loop bodies to vector operations
Bufferize: Convert from tensor semantics to memref semantics
Map to GPU: Map loops to GPU blockIdx/threadIdx

Each step is a Transform Dialect operation, making the entire process fully declarative and reproducible. This makes debugging and performance analysis very intuitive — you can dump IR between any two steps to inspect intermediate results.

Compilation Debugging in Practice

torch.compile Debugging Tools

When using torch.compile in practice, the most common problem is not “how to make it faster” but “why isn’t it reaching expected speed.” PyTorch provides a comprehensive debugging toolkit:

Environment variable debugging:

# View TorchDynamo graph capture logs
import torch._dynamo
torch._dynamo.config.log_level = logging.DEBUG

# View TorchInductor code generation
# Set env var: TORCH_LOGS="output_code"

# View graph break reasons
# Set env var: TORCH_LOGS="graph_breaks"

# View complete compilation logs
# Set env var: TORCH_LOGS="+dynamo,+inductor"

The explain() method:

model = MyModel()
explanation = torch.compile(model, fullgraph=False).explain(input_tensor)
print(explanation)
# Output: graph break locations, reasons, and op statistics for each subgraph

compiler.disable() for precise isolation:

@torch.compiler.disable
def problematic_function(x):
    # This function will not be compiled
    return x.numpy()  # Example: numpy conversion causing graph break

Common Pitfalls

1. Graph Breaks

Graph breaks are the most common performance killer in torch.compile. When TorchDynamo encounters Python operations it cannot trace, it splits the computation graph into multiple subgraphs, each compiled and executed independently. Common graph break triggers include:

print() calls (including debug prints)
.item() or .numpy() conversions
Data-dependent control flow (e.g., if x.sum() > 0)
Unsupported third-party library calls
Custom torch.autograd.Function implementations

2. Dynamic Shape Recompilation

When shapes change, torch.compile by default recompiles for each new shape. If batch sizes vary frequently (e.g., different inference requests), this can cause massive recompilation overhead. Solutions include using torch.compile(dynamic=True) to enable dynamic shape support, or using torch._dynamo.mark_dynamic() to mark specific dynamic dimensions.

3. Excessive Compilation Time

First-time compilation (including autotuning) can take several minutes. For production serving:

Use torch._inductor.config.max_autotune = False to disable exhaustive autotuning
Use torch.compile(mode="reduce-overhead") to balance compilation time and runtime performance
Pre-compile models and cache compilation results

4. Numerical Precision Issues

Compilation optimizations (especially operator fusion and instruction reordering) may change the order of floating-point operations, causing minor numerical differences. For most training scenarios this is not an issue, but for applications sensitive to numerical precision (e.g., RL reward shaping):

Use torch.compile(mode="default") rather than reduce-overhead (which uses more aggressive optimizations)
Verify pre/post-compilation output consistency with torch.testing.assert_close()

Performance Analysis Tools

Locating performance bottlenecks requires multi-level tools:

PyTorch Profiler: End-to-end trace analysis, showing kernel-level execution times:

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU,
                torch.profiler.ProfilerActivity.CUDA],
    with_stack=True,
) as prof:
    compiled_model(input)
print(prof.key_averages().table(sort_by="cuda_time_total"))

Triton Benchmark: Precisely measures individual kernel performance:

import triton.testing

@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=['M'],
        x_vals=[512 * i for i in range(1, 17)],
        line_arg='provider',
        line_vals=['triton', 'cublas'],
        line_names=['Triton', 'cuBLAS'],
        ylabel='TFLOPS',
    )
)
def benchmark(M, provider):
    ...

NVIDIA Nsight Compute: The lowest-level GPU performance analysis tool, providing warp occupancy, SMEM throughput, L2 cache hit rate, and other microarchitectural metrics. When Triton kernel performance falls short of cuBLAS, Nsight Compute is the key tool for identifying the gap.

End-to-End Practice: torch.compile on a Transformer Layer

Now let us thread all 17 articles together, tracing a Transformer layer’s complete journey from Python code to GPU execution.

Step 1: User Code (Articles 1-2)

import torch

class TransformerLayer(torch.nn.Module):
    def __init__(self, d_model=1024, nhead=16):
        super().__init__()
        self.attn = torch.nn.MultiheadAttention(d_model, nhead, batch_first=True)
        self.ff = torch.nn.Sequential(
            torch.nn.Linear(d_model, 4096),
            torch.nn.GELU(),
            torch.nn.Linear(4096, d_model),
        )
        self.norm1 = torch.nn.LayerNorm(d_model)
        self.norm2 = torch.nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0]
        x = x + self.ff(self.norm2(x))
        return x

model = TransformerLayer().cuda().half()
compiled = torch.compile(model, mode="max-autotune")

After calling torch.compile, TorchDynamo traces forward() execution through Python frame evaluation hooks (Article 2: Graph Capture & Dynamo). It produces an FX Graph — a directed acyclic graph containing all operations and their dependencies.

Steps 2-3: IR and Optimization Passes (Articles 3-7)

The FX Graph is passed to TorchInductor, which first applies a series of optimization passes (Articles 5-7):

Constant folding: Pre-computes weight matrix transposes (if column-major layout)
Dead Code Elimination: Removes the attn_weights returned by MultiheadAttention (since it is unused)
Layout optimization: Converts weight tensors from (out, in) to (in, out) or channels-last format to match Tensor Core access patterns
Pattern matching: Identifies LayerNorm + Residual Add patterns, merging them into a single fused kernel

In the MLIR framework (Articles 3-4), this corresponds to progressive lowering from linalg dialect to scf/vector dialect.

Steps 4-5: Operator Fusion and Tiling (Articles 8-11)

TorchInductor’s fusion engine identifies the following fusion opportunities (Articles 8-9):

Pointwise fusion: GELU activation fused onto Linear output
Reduction fusion: LayerNorm’s mean/variance computation merged with subsequent normalization
Epilogue fusion: Residual Add fused onto attention and FFN output MatMuls

Then Tiling is applied (Articles 10-11):

MatMul is tiled to 128x128x32 blocks, mapped to the GPU’s HBM -> SMEM -> Register memory hierarchy
For dynamic batch sizes, symbolic shapes generate parameterized tile boundaries

Step 6: Code Generation (Articles 12-13)

Fused and tiled operations are converted to Triton kernel code (Articles 12-13):

# Triton kernel generated by TorchInductor (simplified)
@triton.jit
def fused_attention_residual(
    Q, K, V, residual, output,
    stride_qm, stride_qk,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr,
):
    pid_m = tl.program_id(0)
    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    # Load Q tile from HBM to registers
    q = tl.load(Q + offs_m[:, None] * stride_qm)
    # Compute attention scores, softmax, weighted sum
    # ... (FlashAttention-style tiled computation)
    # Fused residual add (epilogue fusion!)
    res = tl.load(residual + offs_m)
    output_val = attn_out + res  # residual add in registers
    tl.store(output + offs_m, output_val)

The Triton compiler lowers this Python-like kernel through the full TTIR -> TTGIR -> LLVM IR -> PTX -> cubin pipeline.

Step 7: Advanced Optimizations (Articles 14-16)

If quantization is enabled (Article 14), MatMul uses INT8/FP8 Tensor Cores:

Weights stored in FP8, activations computed in FP8
The compiler automatically inserts scale/descale operations
Throughput improves ~2x (FP16 -> FP8)

For multi-GPU scenarios (Article 15), the compiler inserts communication operations:

Tensor Parallel: QKV projections are sharded; AllReduce executes after output projection
The compiler overlaps AllReduce with the next layer’s LayerNorm (communication-computation overlap)

The scheduler (Article 16) determines kernel execution order:

FFN Up and FFN Gate can execute in parallel on different CUDA Streams
CUDA Graphs eliminate kernel launch overhead

Step 8: Autotuning and Execution (This Article)

The final step is autotuning. For the fused attention kernel above, TorchInductor in max-autotune mode will:

Generate multiple candidate configurations (different combinations of BLOCK_M, BLOCK_N, num_warps, num_stages)
Simultaneously evaluate Triton-generated kernels and cuBLAS/cuDNN reference implementations (backend selection)
Benchmark each configuration, selecting the fastest
Cache the optimal result

Performance Data

After the complete compilation optimization pipeline, typical speedups on A100 80GB:

Model	Scenario	Compiled Speedup	Key Optimizations
GPT-2 (124M)	Training	~1.5x	Fusion + CUDA Graph
LLaMA 7B	Inference (BS=1)	~1.8x	Fusion + Autotune
LLaMA 7B	Inference (BS=32)	~2.0x	Fusion + Tiling + Autotune
LLaMA 70B + INT8	Inference (TP=4)	~2.5-3.0x	Quant + Fusion + Distributed

Note that these numbers vary with PyTorch version, GPU model, and workload characteristics. torch.compile optimization is most impactful in these scenarios:

Multiple pointwise operations (e.g., activation + bias + residual): fusion reduces memory bandwidth requirements by 3-5x
Small batch inference: CUDA Graph elimination of launch overhead has proportionally larger impact when kernel compute is small
Long sequence attention: FlashAttention-style tiling reduces $O(N^2)$ memory to $O(N)$

The interactive component below visualizes the entire 17-article journey:

Summary and Future Directions

The 17-Article Journey in Review

Starting from ML Compiler Landscape, we progressively explored:

Infrastructure layer: Graph capture (TorchDynamo), IR design (SSA/Dialect), progressive lowering
Optimization layer: Graph optimization passes (DCE/CSE/Layout), polyhedral compilation, operator fusion and cost models
Execution layer: Tiling and memory hierarchy, dynamic shapes, instruction selection, Triton backend
System layer: Quantization compilation, distributed compilation, scheduling optimization
Capstone layer: Autotuning and end-to-end practice (this article)

These 17 articles cover the complete path from torch.compile(model) to optimized kernel execution on GPUs. Each layer addresses a core question: how to eliminate inefficiency between computation and data movement.

Future Trends

ML compiler development is far from over. Several directions are worth watching:

1. LLM-Guided Search: Using large language models (like GPT-4) to generate and evaluate optimization schedules, rather than relying on handcrafted rules or traditional search algorithms. Preliminary experiments show LLMs can understand kernel code semantics and propose reasonable optimization suggestions.

2. Hardware-Software Co-design: Co-designing compilers and hardware. Google’s TPU with XLA exemplifies this approach — hardware provides a clear programming model (systolic array) and the compiler fully exploits hardware features. Future AI chips may expose richer compiler interfaces.

3. Unified IR/MLIR Ecosystem: As MLIR matures, different ML frameworks (PyTorch, JAX, TensorFlow) may converge on a unified compiler intermediate representation. This would enable optimization passes to be reused across frameworks, reducing duplicated engineering effort.

4. New Hardware Adaptation: The rise of AMD MI300, Intel Gaudi, and various AI ASICs (Cerebras, Groq, SambaNova) means compilers must support increasingly diverse backends. MLIR’s Dialect system and Transform Dialect’s programmable scheduling provide a solid framework for this.

5. End-to-End Optimization: Current compilers primarily optimize single computation graph execution. The future direction is extending the optimization scope to the entire inference pipeline — including tokenizers, preprocessing, multi-turn conversation management, and integration with serving systems.

If you have read through all 17 articles, I encourage you to return to ML Compiler Landscape for a re-read — with your understanding of every layer’s details, you will have a much deeper appreciation of the ML compiler’s overall architecture. It is like looking back at the path after reaching a summit: every step was for this moment’s panoramic view.