Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

GEMM Optimization — From Naive to Peak Performance

GEMM Optimization — From Naive to Peak Performance

Updated 2026-04-06

GEMM (General Matrix Multiply) is the dominant compute operation in LLM training and inference. This article starts from the simplest naive implementation and progressively adds optimizations — tiling, thread tiling, vectorized memory access, double buffering, Tensor Core — until we approach cuBLAS-level theoretical peak performance.

1. Why GEMM Is the Core of LLMs

Every layer of a Transformer contains multiple matrix multiplications: QKV projections, attention scores, output projection, and the two linear layers of the FFN. For a typical LLM (e.g., hidden_dim=4096), over 90% of compute comes from these GEMM operations.

Transformer Block 中的 GEMM 操作蓝色标注 = 矩阵乘法 (占计算量 90%+)。S = seq_len, H = hidden_dimInput Embedding (S×H)Multi-Head AttentionQKV ProjectionS×H · H×3HAttention ScoreS×H · H×SAttention OutputS×S · S×HOutput ProjectionS×H · H×HAdd & LayerNormFeed-Forward NetworkFFN UpS×H · H×4HFFN DownS×4H · 4H×HAdd & LayerNorm每个 Transformer 层包含 6 个 GEMM — 它们决定了推理和训练的计算时间

GEMM computation is 2MNK2MNK FLOPs (where M, N, K are the dimensions of matrices A, B, C respectively). For large matrices, the arithmetic intensity is well above the roofline’s ridge point, making it compute-bound — meaning in theory we can fully utilize the GPU’s compute units.

C(4096x4096) = A(4096x4096) * B(4096x4096)FLOPs = 2MNK = 137.4GMemory = 4(MK + KN + MN) = 201.3M bytesArithmetic Intensity = FLOPs / Bytes = 682.7 FLOPs/byteRoofline Position (H100 FP32 CUDA Core)ridge: 20AI682.7Memory-boundCompute-boundCompute-bound: sufficient compute, can fully utilize Tensor CoreTypical LLM: M=batch*seq, K=N=hidden_dim (4096+) → AI usually > 100 → Compute-boundGEMM optimization goal: bring actual FLOPS close to peak (FP32: 67T, FP16 TC: 990T)

2. Naive Implementation — Baseline

The simplest CUDA GEMM: each thread computes one element of the output matrix C. The inner loop performs multiply-accumulate along the K dimension:

C[row][col] = sum(A[row][k] * B[k][col]) for k = 0..K-1

Each output element requires reading one row of A (K elements) and one column of B (K elements) from global memory (HBM), totaling 2MNK2MNK memory accesses.

Naive GEMM: One Thread per Output Element
C = A x B (4x4) — Each thread computes one element of CHighlighted thread computes C[1][2]A (4x4)2130102103122103xB (4x4)1021310202101131=C (4x4)C00C01C02C03C10C11C12C13C20C21C22C23C30C31C32C33Naive kernel: Work per threadC[row][col] = 0;for (k = 0; k < K; k++) C[row][col] += A[row][k] * B[k][col]; // 2 global reads per iterationEach output element requires 2K global memory reads → Total 2MNK reads → Severely memory-bound

The problem is obvious: the same row of A is read repeatedly by all threads computing the same row of C. M×NM \times N threads each read independently — massive redundant memory access.

3. Optimization 1 — Tiling + Shared Memory

Core idea: divide the large matrices into BLOCK_SIZE×BLOCK_SIZE\text{BLOCK\_SIZE} \times \text{BLOCK\_SIZE} tiles, load one pair of tiles from HBM into shared memory at a time, and let all threads within the block share and reuse the data.

Global memory access drops from O(MNK)O(MNK) to O(MNK/BLOCK_SIZE)O(MNK / \text{BLOCK\_SIZE}) — a BLOCK_SIZE-fold reduction.

Step 1: Matrix split into Tile grid
Tiling: Split large matrix into BLOCK_SIZE x BLOCK_SIZE tilesEach Block computes one tile of C, iterating over A/B tile pairs along K dimensionA (M x K)xB (K x N)=C (M x N)Tiling Strategy1. Each CUDA Block handles one tile of C (BLOCK_SIZE x BLOCK_SIZE threads)2. Computing C[tile_r][tile_c] requires all tiles in row tile_r of A x all tiles in col tile_c of B3. Outer loop: for t = 0 to K/BLOCK_SIZE — load one tile pair to Shared Memory each time4. Key: Each tile loaded from HBM only once, shared and reused by BLOCK_SIZE threads

Double __syncthreads() ensures: (1) all threads finish loading the tile before computation begins; (2) computation of the current tile finishes before loading the next one.

4. Optimization 2 — Thread Tiling (Multiple Elements per Thread)

Tiling solves the HBM bandwidth problem, but shared memory bandwidth can also become a bottleneck. When each thread computes only 1 element, the inner loop reads shared memory twice and performs 1 FMA per step — the compute:load ratio is only 0.5.

Thread tiling: each thread handles TM×TNTM \times TN output elements (e.g., 4x4=16 elements). After loading TM values from A and TN values from B into registers, TM×TNTM \times TN FMAs are produced — the compute:load ratio improves to TM×TNTM+TN\frac{TM \times TN}{TM + TN}.

1x1 Thread Tile: One thread computes one element
Naive Tiling: Each thread computes 1 element of CEach inner loop iteration: read 2 values from shared memory (A and B), do 1 FMAC tile (BLOCK_SIZE x BLOCK_SIZE)Thread(2,3)Inefficient: Compute/Load ratio = 1:2Each inner loop: read As[ty][k] + read Bs[k][tx] = 2 shared memory readsCompute: 1 FMA (fused multiply-add)Compute:Load = 1 FMA / 2 reads = 0.5 — shared memory bandwidth becomes bottleneck
Thread Tile = 4 x 4 = 16 output elementsA[4]B[4]C[4x4]Per inner loop k step:Reads from shared mem: TM + TN = 8 timesFMA compute: TM x TN = 16 timesCompute:Load = 4x4 / (4+4) = 2.00GoodRegister usage: C accumulator (4x4=16) + A fragment (4) + B fragment (4) = 24 registersMore registers → larger thread tile → higher ratio, but occupancy may drop (trade-off)

5. Optimization 3 — Vectorized Memory Access

GPU memory buses support 32/64/128-bit wide load instructions. Using float4 (128-bit) to load 4 floats at once reduces instruction scheduling overhead by 3/4 compared to 4 scalar loads.

向量化访存: float vs float4标量加载 (4 条指令)LDG.32 R0, [addr + 0]32bLDG.32 R1, [addr + 4]32bLDG.32 R2, [addr + 8]32bLDG.32 R3, [addr + 12]32b向量加载 (1 条指令)LDG.128 R0:R3, [addr]一次读取 128 bits = 4 个 float128b对比指令数4 条 LDG.321 条 LDG.128总传输量4 x 32b = 128b1 x 128b = 128b指令发射开销4 个调度槽1 个调度槽float4 tmp = *reinterpret_cast<float4*>(&A[row * K + k]); // 128-bit aligned load

This requires 128-bit address alignment. In tiling, tile starting addresses are typically naturally aligned.

6. Optimization 4 — Double Buffering Prefetch

While computing the current tile, simultaneously prefetch the next tile — overlapping memory access with compute latency.

Without Double Buffering: Serial Load and Compute
Serial Pipeline: Compute Must Wait for LoadEach tile: Load to shared memory → __syncthreads() → Compute → __syncthreads() → NextTime →LoadLoad T0Load T1Load T2Load T3ComputeCompute T0Compute T1Compute T2Compute T3idleidleidleidleidleidleidleidleTotal time = 4 x (Load + Compute) — Half the time idle!Problem: Load and Compute Cannot OverlapMust load tile to shared memory before compute → Must finish compute before loading next tileReason: Only one shared memory buffer, load and compute operate on the same memorySolution: Use two buffers — one loads new data while the other is used for compute(Or use register prefetching: read to register first, write to shared memory after computing current tile)

7. Optimization 5 — Tensor Core GEMM

Switching from CUDA Core to Tensor Core: using the WMMA (Warp Matrix Multiply-Accumulate) API, a single warp-level instruction completes a 16×16×1616 \times 16 \times 16 matrix block multiply-accumulate.

Block-level tiling (shared memory) is still needed — Tensor Core only replaces the innermost compute unit.

Tensor Core GEMM 的多级 Tiling从 Grid 到 Tensor Core,每级 tile 缩小到适合当前硬件层级Level 1: Grid Tile每个 CUDA Block 负责 C 的一个 BM x BN 区域 (如 128 x 128)决定: 每个 Block 的工作量、shared memory 需求Level 2: Warp TileBlock 内每个 Warp 负责 C 的 WM x WN 区域 (如 32 x 64)决定: Block 内 Warp 的分工、寄存器分配Level 3: MMA Instruction Tile每个 Warp 内层循环用 wmma::mma_sync 做 16 x 16 x 16 (或 m16n8k16) 的矩阵块乘这一步由 Tensor Core 硬件执行 — 一条指令完成整块乘加Tensor Core: D(16x16) = A(16x16) * B(16x16) + C(16x16)wmma::mma_sync<16,16,16,half> (底层映射到多条 PTX mma.sync 指令)典型尺寸 (H100 HGEMM)Grid tile (Block)128 x 128shared memory 中Warp tile32 x 64register fragment 中MMA tile (WMMA)16 x 16 x 16Tensor Core 一条 warp 级指令

The three-step WMMA flow: load_matrix_sync (shared memory → register fragment) → mma_sync (Tensor Core execution) → store_matrix_sync (write back).

Step 1: load_matrix_sync — Load Fragment
WMMA Step 1: Load Fragment from Shared Memory32 threads cooperatively load matrix block into their registers (fragment)Shared MemoryAs[16x16] + Bs[16x16]Warp (32 threads) — Thread RegistersA fragmentwmma::matrix_a<16,16,16,half>B fragmentwmma::matrix_b<16,16,16,half>C accumulatorwmma::accumulator<float>// Declare fragment (each thread holds part of matrix)wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;wmma::load_matrix_sync(a_frag, &As[warp_row * 16], 16); // shared → registerFragment Distribution16x16 = 256 elements held by 32 threads: each thread holds 8 elements in registersHardware determines which thread holds which elements — opaque to programmer (only accessible via wmma API)

FP16 input + FP32 accumulation = manageable precision loss + 4-8x throughput improvement.

8. Performance Ladder Summary

Performance improvement from each optimization step (using 4096x4096 on H100 as reference):

GEMM Optimization Performance Ladder (H100, 4096x4096)GFLOPS at each step (first 5: FP32 SGEMM, last: FP16 HGEMM) — hover for detailscuBLAS ~65K (97%)Naive400 (0.6%)+ Block Tiling8K (12%)+ Thread Tile25K (37%)+ Vec Load35K (52%)+ Double Buffer45K (67%)Tensor Core (FP16)60K (~90%)Hover over optimization stages for detailsCore optimization: reduce memory access → increase data reuse → leverage specialized hardware (Tensor Core) → approach theoretical peak

From the naive implementation at less than 1% utilization to Tensor Core approaching 90% — the core philosophy remains: reduce memory access → increase data reuse → leverage dedicated hardware.

9. GEMM on Intel iGPU

The GEMM optimization approach on Intel Xe2 (Lunar Lake / Panther Lake) is exactly the same as CUDA — only the terminology and APIs differ:

CUDA GEMM vs Intel GEMM: Tiling 层级对照思路完全相同 — 只是 API 和硬件单元名称不同CUDA (NVIDIA)SYCL / DPC++ (Intel)Grid 级Block tile: BM x BN(Shared Memory)Work-group tile: BM x BN(SLM / Shared Local Memory)Warp/Sub-group 级Warp tile: WM x WN(Register fragment)Sub-group tile: WM x WN(GRF / Register)指令级wmma::mma_sync(Tensor Core 16x16x16)joint_matrix_mad(XMX systolic array)数据类型FP16 in → FP32 acc(mma.sync.f32.f16)FP16/BF16 in → FP32 acc(dpas / XMX)编程 APICUDA wmma / mma.sync(PTX / CUTLASS)SYCL joint_matrix(或 ESIMD intrinsics)核心相同: HBM → SLM/smem (block tile) → Register (warp/sub-group tile) → 矩阵硬件 (TC/XMX)

Core mapping: shared memory → SLM, warp → sub-group, Tensor Core → XMX, wmma → joint_matrix. The essence of optimization doesn’t change: move data from far memory to near memory, and maximize reuse at the fastest storage level.