Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

CUDA Programming Model — From Code to Hardware

CUDA Programming Model — From Code to Hardware

Updated 2026-04-06

In the GPU Architecture article, we learned about the internal structure of an SM — four Processing Blocks, Warp Schedulers, various compute units, and the memory hierarchy. In the Matrix Acceleration Units article, we saw warp-level cooperative operations on Tensor Core and XMX.

Now the question is: how do programmers control this hardware? This article starts from the CUDA programming model to understand the core abstractions of GPU programming — thread hierarchy, memory model, synchronization mechanisms, and how these abstractions map to physical hardware.

Section 1: SIMD vs SIMT — Two Parallel Execution Models

GPU parallel computing has two primary models. Understanding the difference between them is the starting point for understanding CUDA programming.

SIMD (Single Instruction Multiple Data): One instruction operates on a vector. The programmer must know the vector width (8/16/32) and use intrinsics or compiler vectorization to leverage the hardware. The EU inside Intel iGPU is SIMD-driven.

SIMT (Single Instruction Multiple Threads): The programmer writes scalar code (looks like single-threaded), and the hardware automatically packs 32 threads into a warp for concurrent execution. The programmer doesn’t need to worry about vector width. NVIDIA GPUs use SIMT.

Basic Operation: a[i] = b[i] + c[i]
Two execution models for the same operation a[i] = b[i] + c[i]SIMD (Intel iGPU)One instruction operates on 8-wide vectorvaddb0+c0b1+c1b2+c2b3+c3b4+c4b5+c5b6+c6b7+c7vadd.8 a[0:7], b[0:7], c[0:7] // one vector instructionSIMT (NVIDIA GPU)32 threads each execute scalar code (showing 8)T0-T7b0+c0b1+c1b2+c2b3+c3b4+c4b5+c5b6+c6b7+c7a[tid] = b[tid] + c[tid]; // scalar code x32 threadsSame result — but different programming models: SIMD programmer must know vector width, SIMT programmer writes scalar codeSIMD: compiler/programmer responsible for vectorization | SIMT: hardware automatically packs 32 scalar threads into warpSIMD FeaturesExplicit vector width (8/16/32)Programmer uses intrinsics or compiler vectorizationIntel EU / CPU SSE/AVXSIMT FeaturesVector width transparent to programmerWrite scalar code, hardware packs executionNVIDIA Warp (32 threads)

The key difference is in branch handling: SIMT is more branch-friendly — warp divergence is only an efficiency loss, not something the programmer needs to manually manage with masks. SIMD branches require explicit mask or blend instructions.

三种并行执行模型对比SIMD (经典)SIMT (NVIDIA)Intel iGPU (混合)编程视角显式向量指令(intrinsic / 编译器向量化)标量代码(硬件自动并行)SYCL work-item 标量代码(sub-group 暴露 SIMD)硬件执行一条指令操作N-wide 向量寄存器Warp (32 threads)锁步执行同一指令EU Thread 驱动8/16-wide SIMD ALU分支处理需要显式 mask或 blend 指令硬件自动 mask(warp divergence)硬件 mask(channel enable)向量宽度程序员必须知道(8/16/32)对程序员透明(始终 32-wide warp)部分可见(sub-group size)典型硬件CPU (SSE/AVX)Intel EU (底层)NVIDIA SM(FP32 / INT32 Core)Intel Xe-Core(Vector Engine + XMX)

Intel iGPU is an interesting hybrid: the underlying hardware is SIMD-driven (EU Threads execute 8/16-wide vector operations), but the SYCL/OpenCL programming layer provides a near-SIMT work-item abstraction. Sub-group operations expose the underlying SIMD width and are a critical tool for Intel GPU programming.


Section 2: Thread → Block → Grid

CUDA organizes parallel computation with a three-level thread hierarchy:

  • Thread: The smallest execution unit. Each thread executes the same kernel code but processes different data
  • Block: A group of threads that share Shared Memory and can synchronize via __syncthreads()
  • Grid: The collection of all Blocks, produced by a single kernel launch

Each thread determines its identity through threadIdx (position within the Block) and blockIdx (Block position within the Grid), combined with blockDim (Block size) to compute its global data position.

Grid: Collection of Blocks
Grid: Collection of all Blocks (gridDim = 4×3)Each Block is an independent thread group, can be assigned to any SM for executionGridBlock(0,0)blockIdx=(0,0)Block(1,0)blockIdx=(1,0)Block(2,0)blockIdx=(2,0)Block(3,0)blockIdx=(3,0)Block(0,1)blockIdx=(0,1)Block(1,1)blockIdx=(1,1)Block(2,1)blockIdx=(2,1)Block(3,1)blockIdx=(3,1)Block(0,2)blockIdx=(0,2)Block(1,2)blockIdx=(1,2)Block(2,2)blockIdx=(2,2)Block(3,2)blockIdx=(3,2)Highlight Block(1,1) — next step expands to see internal thread structureblockIdx.x = 0..gridDim.x-1, blockIdx.y = 0..gridDim.y-1gridDim specifies the number of Blocks in the Grid (per dimension)

Blocks and Grids support 1D/2D/3D dimensions — 2D indexing maps more naturally to matrix operations (threadIdx.x corresponds to columns, threadIdx.y to rows).

Global Index Calculation

The most fundamental CUDA programming pattern: each thread computes its responsible global data position.

Total threads: 32 | Click to select thread
1D Grid: 4 blocks × 8 threads = 32 threadsBlock 0T0g=0T1g=1T2g=2T3g=3T4g=4T5g=5T6g=6T7g=7Block 1T0g=8T1g=9T2g=10T3g=11T4g=12T5g=13T6g=14T7g=15Block 2T0g=16T1g=17T2g=18T3g=19T4g=20T5g=21T6g=22T7g=23Block 3T0g=24T1g=25T2g=26T3g=27T4g=28T5g=29T6g=30T7g=31Global Index CalculationglobalIdx = threadIdx.x + blockIdx.x * blockDim.xglobalIdx = 3 + 1 × 8 = 11threadIdx.x = 3 (thread position within block) | blockIdx.x = 1 (block position within grid) | blockDim.x = 8 (threads per block)This globalIdx is the data element index this thread processes: a[11] = b[11] + c[11]
// Vector addition kernel — the simplest CUDA program
__global__ void vecadd(float* a, float* b, float* c, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) c[i] = a[i] + b[i];
}
// Launch: vecadd<<<(n+255)/256, 256>>>(d_a, d_b, d_c, n);

Section 3: Logical to Physical Mapping

Thread → Block → Grid is a logical structure. The programmer defines it but does not control how it maps to physical hardware.

Grid: 6 Blocks to allocate
Logical view: Grid contains 6 Blocks (each 128 threads = 4 warps)Block 0128 threadsBlock 1128 threadsBlock 2128 threadsBlock 3128 threadsBlock 4128 threadsBlock 5128 threadsPhysical view: 4 SMs (each holds up to 2 Blocks)SM 0(idle)SM 1(idle)SM 2(idle)SM 3(idle)Block-to-SM assignment is decided by runtime, order is non-deterministic and uncontrollable — programs should not assume assignment order

Key points:

  • Block-to-SM assignment is decided by the runtime — the order is non-deterministic, and programs should not assume any execution order
  • A single SM can host multiple Blocks simultaneously — limited by register usage, shared memory usage, and warp count
  • Threads within a Block are packed into Warps by hardware — threads 0-31 = warp 0, threads 32-63 = warp 1, …
  • blockDim should be a multiple of 32 — otherwise the last warp has idle threads, wasting compute resources

Section 4: Shared Memory

Memory declared with __shared__ is Block-level fast storage (in the SM’s L1/SRAM), shared by all threads in the Block, and released when the Block finishes.

Uses:

  • Inter-thread communication: One thread writes, other threads read
  • Data preloading: Load once from HBM to shared memory, reuse across multiple threads within the block

Bank Structure

Shared memory is divided into 32 banks, with consecutive 4-byte words mapped to consecutive banks. When a warp’s 32 threads access different banks simultaneously, it completes in one cycle; accessing different addresses in the same bank causes a bank conflict, which must be serialized.

Stride=1: Conflict-Free
Shared Memory: 32 Banks, consecutive 4 bytes map to consecutive BanksThread i accesses address i × 4 bytes → Bank = i % 32Stride=1: Thread i → Bank i (each thread accesses a different Bank)ThreadsT0T1T2T3T4T5T6T7T8T9T10T11T12T13T14T15T16T17T18T19T20T21T22T23T24T25T26T27T28T29T30T31BanksB0B1B2B3B4B5B6B7B8B9B10B11B12B13B14B15B16B17B18B19B20B21B22B23B24B25B26B27B28B29B30B31No Bank Conflict — all 32 accesses complete in one cycle32 threads each access a different Bank, hardware serves all requests in parallelBank mapping: Bank(addr) = (addr / 4) % 32 — consecutive 4 bytes in consecutive Banks

The simplest way to avoid bank conflicts: stride-1 sequential access is naturally conflict-free. When the stride is a power of 2, conflicts are common; you can break the alignment with a padding trick (tile[32][33] instead of tile[32][32]).


Section 5: Memory Coalescing

Global memory (HBM) access operates in 32-byte or 128-byte transactions. When a warp’s 32 threads access contiguous addresses, the hardware can merge them into the minimum number of transactions — this is memory coalescing.

Coalesced: Sequential Access
Coalesced Access: Thread i reads A[i] (contiguous addresses)32 threads read 32 consecutive floats (128 bytes) → coalesced into 1 × 128B transactionT0A[0]T1A[1]T2A[2]T3A[3]T4A[4]T5A[5]T6A[6]T7A[7]T8A[8]T9A[9]T10A[10]T11A[11]T12A[12]T13A[13]T14A[14]T15A[15]Global Memory (HBM)1 × 128B transactionEfficiency: 128 / 128 = 100% bandwidth utilizationTransfer 128 bytes, effective data 128 bytes — zero wasteGPU memory controller coalesces contiguous addresses in warp into minimal transactionsRow-major matrix access: thread i reads M[row][i] — contiguous addresses, naturally coalesced

Practical impact:

  • Row-major access to the same row of a matrix: M[row][tid] — contiguous addresses, naturally coalesced
  • Column-major access to a matrix: M[tid][col] — stride = row width, severely uncoalesced
  • This is why matrix multiplication requires tiling into shared memory — first load to shared memory with coalesced access, then access with arbitrary stride in shared memory (bank conflicts in shared memory are much cheaper than uncoalesced HBM access)

Section 6: Synchronization and Barriers

__syncthreads() is a Block-level barrier: all threads must reach this point before any can continue execution.

__syncthreads() — Block 内 Barrier 同步正确: 写 Shared Memory → __syncthreads() → 读 Shared MemoryWarp 0写 smem等待读 smemWarp 1写 smem等待读 smemWarp 2写 smem等待读 smemWarp 3写 smem等待读 smem__syncthreads()所有线程到达后才继续如果没有 __syncthreads():Warp 0 写完 smem 后立即读 → 但 Warp 3 还没写完 → 读到的是旧数据或未初始化数据Race Condition: 结果取决于 warp 执行顺序,不确定且不可复现注意: 所有线程必须执行到同一个 __syncthreads() — 不能在 if/else 分支中不对称调用

Typical usage: write to shared memory → __syncthreads() → read from shared memory. Without the barrier, fast warps might read data that slow warps haven’t finished writing — a race condition.

Important notes:

  • All threads must reach the same __syncthreads() call — asymmetric calls in if/else branches cause deadlock
  • Threads within a warp execute in lockstep, with implicit synchronization — but explicitly using __syncwarp() is safer (future architectures may change the lockstep guarantee)
  • __syncthreads() only synchronizes within a Block — there is no direct synchronization mechanism between Blocks (this is a core constraint of the GPU programming model)

Section 7: Occupancy

Occupancy = active warps in an SM / maximum warps per SM. Higher occupancy means more warps can switch execution during memory latency, better hiding that latency.

Occupancy is limited by three factors:

  1. Warp count: blockDim / 32 warps per block × blocks per SM
  2. Register usage: More registers per thread means fewer blocks the SM can accommodate
  3. Shared memory usage: More shared memory per block means fewer blocks the SM can accommodate
Occupancy Calculator — H100 (Hopper SM)100% OccupancyActive Blocks: 8 | Active Warps: 64 / 64 | Warps/Block: 8Max Blocks/SM per resource:Warps:8 blocks ← BottleneckRegisters:8 blocks Shared Mem:14 blocks Max Blocks:32 blocks Occupancy = active warps / max warps = 64 / 64 = 100% — Bottleneck: Warps

Higher occupancy isn’t always better — sometimes low occupancy + high data reuse (large tiles filling shared memory and registers) is actually faster. But occupancy is typically a good starting point for optimization.

At compile time, use --ptxas-options=-v to check a kernel’s register and shared memory usage.


Section 8: Intel iGPU Programming Essentials

CUDA is NVIDIA-proprietary. Intel GPUs use SYCL / DPC++, based on standard C++, with concepts that map one-to-one with CUDA:

CUDA vs SYCL: 同一个向量加法 Kernel颜色标注对应概念 — 核心逻辑完全相同,只是 API 不同CUDA C++// CUDA Vector Addition__global__void vecadd(float* a, float* b, float* c, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if (i < n) c[i] = a[i] + b[i];}// Launchvecadd<<<gridDim, blockDim>>> (d_a, d_b, d_c, n);SYCL (Intel DPC++)// SYCL Vector Additionq.parallel_for( nd_range<1>(N, block_size), [=](nd_item<1> item) { int i = item.get_local_id(0) + item.get_group(0) * item.get_local_range(0); if (i < n) c[i] = a[i] + b[i];});// Launch: 内置于 parallel_for// q 是 sycl::queue// 设备选择在 queue 创建时指定概念映射threadIdx.xget_local_id(0)Block/Work-group 内线程 IDblockIdx.xget_group(0)Block/Work-group IDblockDim.xget_local_range(0)Block/Work-group 大小__global__parallel_forKernel 入口<<<grid, block>>>sycl::queue启动 / 设备选择CUDA: __shared__ → SYCL: local accessor | CUDA: __syncthreads() → SYCL: group_barrier()CUDA: warp (32 threads) → SYCL: sub-group (8/16/32, 宽度由硬件决定)

Core terminology mapping:

  • work-item ≈ thread — the smallest execution unit
  • work-group ≈ block — a group of threads sharing SLM
  • sub-group ≈ warp — but the width can be 8/16/32 (not fixed at 32)
  • SLM (Shared Local Memory) ≈ shared memory — similar usage

Sub-group is the key to Intel GPU programming — it directly exposes the underlying SIMD width. sub_group::shuffle and sub_group::reduce correspond to NVIDIA’s warp shuffle operations.

XMX matrix operations are accessed through the SYCL joint_matrix API or the low-level ESIMD dpas instruction, corresponding to NVIDIA’s wmma / mma.sync (see the Matrix Acceleration Units article for details).


Summary

Core abstractions of the CUDA programming model:

  1. SIMT execution model — Write scalar code, hardware parallelizes. Warp divergence is an efficiency problem, not a correctness problem
  2. Three-level thread hierarchy — Thread → Block → Grid, where Block is the fundamental unit of resource allocation and synchronization
  3. Memory hierarchy — Register (private) → Shared Memory (Block-shared, fast) → Global/HBM (global, slow)
  4. Coalescing + Banks — Global memory needs contiguous access, shared memory needs to avoid bank conflicts
  5. Occupancy — warp count × data reuse = performance; the tightest of three resources (warps / registers / shared memory) determines the upper bound

The next article will put these concepts into practice — GEMM optimization, from a naive implementation to Tensor Core GEMM, progressively pushing matrix multiplication performance toward the hardware’s theoretical peak.