GPU Architecture — From Transistors to Threads

In the AI Compute Stack Overview article, we understood the seven-layer structure from inference frameworks to hardware ISA from a software stack perspective. The bottommost layer in that article was “Hardware ISA — the only thing a GPU can execute.”

This article drills one level deeper: what GPU hardware itself looks like. Understanding hardware architecture is a prerequisite for writing high-performance GPU code — you need to know what physical structures instructions execute on before you can make correct optimization decisions.

Section 1: GPU vs CPU — Two Design Philosophies

CPUs and GPUs face two fundamentally different computing demands, and thus took entirely different design paths.

CPU design goal: Minimize the latency of a single task. To achieve this, CPUs invest a large transistor budget in:

Large multi-level caches (L1/L2/L3, sometimes 50%+ of chip area) — keeping data as close to compute units as possible
Complex control logic — branch prediction, out-of-order execution, register renaming, making each individual thread run as fast as possible
A small number of high-performance cores (4-8) — each core is a “heavy infantry”

GPU design goal: Maximize total throughput. A completely different strategy:

Massive numbers of simple ALUs — thousands of small compute units, each weak individually, but with extremely high aggregate throughput
Minimized control logic — no complex branch prediction or out-of-order execution, control logic takes a minimal share
Small caches, using thread switching to hide latency — instead of relying on cache hits to speed things up when encountering memory waits, the GPU immediately switches to another group of threads to continue execution

Latency Hiding: Two Radically Different Strategies

When a CPU encounters a cache miss (data not in cache), the thread stalls — doing nothing, waiting idly for data to return from memory. This waiting time is completely wasted. The CPU’s countermeasure is: use larger caches to reduce the probability of misses.

The GPU’s countermeasure is entirely different: don’t reduce the waiting, but do other work during the wait. A GPU has thousands of threads active simultaneously (grouped into warps). When one group of threads is waiting for memory, the hardware warp scheduler immediately switches to another group of threads that are ready. As long as there are enough concurrently active threads, the waiting time can be completely filled.

This is why GPUs need massive numbers of threads — not because there are that many independent tasks to do, but because enough threads are needed to mask memory latency. This concept is called latency hiding, and it is the most fundamental idea for understanding GPU architecture.

Section 2: NVIDIA GPU Global Structure

Using the NVIDIA H100 as an example, here is the hierarchical structure from chip level to the smallest compute unit:

GPU Chip → GPC (Graphics Processing Cluster) → TPC (Texture Processing Cluster) → SM (Streaming Multiprocessor)

GPC is the largest logical grouping. The H100 has 8 GPCs.
Each GPC contains several TPCs (Texture Processing Clusters). TPC is a legacy name (from graphics rendering); in compute scenarios, it’s mainly a container for SMs.
Each TPC contains 2 SMs — the SM is the GPU’s core compute unit and the level we most need to understand.
All SMs share a large L2 Cache (~50MB on H100).
Beyond L2 is HBM (High Bandwidth Memory) — the GPU’s main memory.

H100 SXM full chip: 8 GPC x 9 TPC x 2 SM = 144 SM (132 actually enabled), 50 MB L2, 80 GB HBM3.

For comparison: RTX 4090 has 128 SMs, 96 MB L2, 24 GB GDDR6X. The architecture core is the same (both based on Ada / Hopper SM), but scale and memory technology differ.

Section 3: Inside the SM

The SM (Streaming Multiprocessor) is the unit within the GPU that actually executes computation. Understanding the SM’s internal structure means understanding the GPU’s execution model.

Four Processing Blocks

Each SM is divided into 4 Processing Blocks (also called Sub-partitions), each containing:

1 Warp Scheduler + 1 Dispatch Unit — selects and issues instructions
FP32 CUDA Cores — perform floating-point multiply-add (32 per block on Hopper, 128 per SM total)
INT32 Cores — integer operations
Tensor Cores — matrix multiply-accumulate acceleration units (detailed in the next article)
SFU (Special Function Unit) — computes transcendental functions like sin/cos/exp
Load/Store Units — load/store data from/to memory

Processing Block Local Resources

Each Processing Block also has its own register file:

Register File — 256 KB total per SM, physically distributed across the processing blocks (NVIDIA architecture whitepapers show each sub-partition with its own independent register file in SM diagrams). This is the fastest storage on the GPU; each thread can use up to 255 32-bit registers. A warp’s registers are stored in the local register file of its assigned processing block, and warps do not migrate between processing blocks once assigned. The register file is large because the GPU needs to simultaneously maintain the register state of thousands of threads (the key to achieving zero-overhead warp switching)

SM-Level Shared Resources

The 4 Processing Blocks share:

Shared Memory / L1 Cache — 256 KB (Hopper) of on-chip SRAM, of which up to 228 KB can be configured as shared memory with the remainder used as L1 cache. Physically belongs to the entire SM, but logically partitioned by thread block — all threads within the same thread block (across all warps, across all processing blocks) can access the same shared memory region, while different thread blocks cannot see each other’s shared memory. The allocation ratio between shared memory and L1 cache is configurable

Generational Evolution

Resource	Ampere (A100) 2020	Hopper (H100) 2022	Blackwell (B200) 2024
SM Count per GPU	108	132↑	192↑
FP32 Core / SM	128	128	128
INT32 Core / SM	128	64↑	64
Tensor Core / SM	4 (3rd gen)	4 (4th gen)↑	4 (5th gen)↑
Tensor Core Precision	FP16/BF16/TF32/INT8/INT4	FP16/BF16/TF32/FP8/INT8↑	FP16/BF16/TF32/FP8/FP4/INT8↑
Register File / SM	256 KB	256 KB	256 KB
Shared Memory / SM	164 KB	228 KB↑	228 KB
L1 Cache / SM	192 KB 共享	256 KB 共享↑	256 KB 共享
Max Warps / SM	64	64	64
Max Threads / SM	2048	2048	2048

Green arrows ↑ indicate changes from previous generation. Hover to highlight row.

Section 4: Warp — The GPU’s Minimum Execution Unit

What Is a Warp

A Warp = a group of 32 threads, the minimum unit of GPU scheduling and execution. The 32 threads in a warp execute the same instruction in lockstep at the hardware level (SIMT — Single Instruction, Multiple Threads).

Think of it like a platoon of soldiers marching in step — 32 people step left simultaneously, step right simultaneously. They execute the same “instruction” (cadence) but each on different “data” (ground position).

Each Processing Block’s Warp Scheduler selects one ready warp (a warp not waiting on memory) each cycle and issues one instruction for it. The 4 Processing Blocks can each select a warp simultaneously, so an SM can issue up to 4 instructions per cycle.

Warp Divergence

All 32 threads must execute the same instruction — so what happens if the code has an if/else?

This is Warp Divergence: when threads within a warp take different branch paths, the hardware cannot execute both paths simultaneously. The solution is serialization — first execute the if-path (threads not taking the if-path are masked out), then execute the else-path (threads not taking the else-path are masked out). The time for both paths is added together.

Normal Execution: 32 Threads in Lockstep

A Warp contains 32 threads executing the same instruction (SIMT). All threads synchronously execute z[i] = x[i] + y[i]。

Warp 0 — All threads execute z[i] = x[i] + y[i]

32 threads all active, hardware efficiency 100%

Active

Masked (waiting)

This is why GPU code should minimize branching — not that you can’t write if/else, but that all 32 threads in the same warp should ideally take the same path.

Warp Scheduler and Latency Hiding

The warp scheduler is the SM’s “scheduling brain.” Its job is simple: pick a ready warp and issue its next instruction.

When a warp executes a memory load instruction, it needs to wait hundreds of cycles for the data. The scheduler doesn’t wait — it immediately switches to another ready warp to continue execution. This switch has zero overhead (zero-overhead context switch), because all warps’ register states permanently reside in the SM’s register file, requiring no save/restore.

Cycle 0-1: Warp 0 Executes

Warp Scheduler selects Warp 0 to execute compute instructions. After 2 cycles, encounters global memory access — needs to wait ~hundreds of cycles.

Key: While Warp 0 waits for memory, it doesn't block the entire SM — scheduler immediately switches to another ready warp

Hardware Concepts vs Software Abstractions: How They Map

GPU programming has many concepts with “block” in the name, which can be confusing. The key is distinguishing hardware entities from software abstractions:

Concept	Type	Description
SM	Hardware	GPU’s basic compute unit
Processing Block	Hardware	Sub-partition inside an SM (each has its own warp scheduler, CUDA cores, Tensor Core)
Grid	Software	All threads in a single kernel launch
Thread Block	Software	A programmer-defined group of threads (e.g., 256), sharing shared memory, synchronizable via `__syncthreads()`
Warp	Bridge	Every 32 consecutive threads within a thread block are automatically grouped into a warp, scheduled in lockstep by hardware

Mapping:

Thread Block → SM: A thread block is assigned as a whole to one SM, sharing that SM’s shared memory. One SM can run multiple thread blocks concurrently (depending on whether register and shared memory resources are sufficient — this is occupancy)
Warp → Processing Block: Threads within a thread block are automatically divided into warps of 32, with each warp assigned to one of the SM’s processing block’s warp schedulers. Once assigned, a warp stays on that processing block and does not migrate — because the warp’s register state is stored in that processing block’s local register file, and migrating would mean moving up to 32 × 255 × 4 = ~32 KB of register data, which is prohibitively expensive. Switching between warps within the same processing block is zero-overhead — the scheduler simply switches to another warp’s register region with no save/restore needed. This contrasts with CPUs, where threads can migrate between cores (OS load balancing); GPU warps rely on static distribution and massive parallelism for load balancing
Registers: Private to each thread. Threads within the same warp cannot directly access each other’s registers (but can exchange values through warp shuffle instructions like __shfl_sync)

Note: “Processing Block” and “Thread Block” are completely different — the former is a hardware partition inside an SM, the latter is a programmer-defined software grouping. The names are similar but there is no one-to-one correspondence.

Why do we need both warp and thread block as two layers of abstraction?

Warp (32 threads) solves execution efficiency: Under the SIMT model, 32 threads in a warp share a single instruction fetch/decode unit, with hardware dispatching instructions uniformly, while each thread maintains its own independent registers and PC. This way, only 1 set of control logic drives 32 ALUs, dramatically saving transistor area. If a 256-thread thread block didn’t split into warps, it would need 256 independent fetch/decode units — which is impractical
Thread Block (e.g., 256 threads) solves cooperation needs: Many algorithms need more than 32 threads to share data (e.g., GEMM tiling requires many threads to cooperatively load data into shared memory). Thread blocks provide shared memory and __syncthreads() synchronization, enabling multiple warps to cooperate

In short: thread blocks manage cooperation, warps manage execution.

How Intel GPUs Differ

The above describes NVIDIA’s SIMT model. Intel Xe2 GPUs take a fundamentally different approach — compiler-driven SIMD. Here’s how the concepts map:

NVIDIA (SIMT)	Intel Xe2 (SIMD)	Key Difference
Thread Block	Work-group	Similar: shared local memory, synchronizable
Warp (32 threads)	Sub-group (8/16/32 lanes)	NVIDIA groups at runtime in hardware; Intel vectorizes at compile time
SM	Xe-core	Similar: contains multiple execution units and shared memory
Processing Block	EU (Execution Unit)	NVIDIA has warp scheduler; Intel EU directly executes SIMD instructions
Shared Memory	SLM (Shared Local Memory)	Similar: shared within work-group
Branch handling	Branch handling	NVIDIA: runtime hardware masking; Intel: compile-time predication (compiler converts if/else to predicated instructions — all lanes execute but a mask controls writeback)

The core difference: NVIDIA programmers write scalar code, and hardware automatically groups 32 threads into a warp for lockstep execution, dynamically masking on branches. Intel programmers also write scalar code, but the compiler is responsible for vectorizing it into SIMD instructions, with branches converted to predicated execution at compile time. NVIDIA offers more flexibility (hardware handles divergence); Intel’s approach relies more on the compiler but keeps hardware simpler.

For the detailed Intel Xe2 execution model, sub-group operations, and SYCL programming mappings, see Xe2 Execution Model and Programming Abstractions.

Section 5: Memory Hierarchy

The GPU’s memory hierarchy is the single biggest factor affecting performance. From fastest to slowest:

Level	Capacity (H100)	Bandwidth	Latency	Scope
Register File	256 KB / SM	Extremely high (on-chip)	0 cycles	Thread-private
Shared Memory / L1	256 KB / SM (up to 228 KB configurable as shared memory)	Extremely high (on-chip)	~20-30 cycles	Shared within Thread Block
L2 Cache	50 MB	~12 TB/s (theoretical)	~200 cycles	Globally shared
HBM3	80 GB	3.35 TB/s	~400-600 cycles	Global

Key numbers: The latency gap from register to HBM is several hundred times, and the bandwidth gap is 5-6x. This means:

Keeping data in registers and shared memory as long as possible is the core of optimization
Reading data from HBM is expensive — once loaded, it should be reused as many times as possible
This is the fundamental motivation behind tiling strategies, which we’ll cover in detail in the GEMM optimization article

Flash Attention’s core innovation is precisely based on this memory hierarchy — tiling the Attention computation into SRAM (shared memory) to avoid writing large intermediate matrices back to HBM. See Flash Attention Tiling Principles for details.

Summary

The core design philosophy of GPU architecture can be summarized in three points:

Throughput-first — Thousands of simple cores, sacrificing single-thread latency for total throughput
Latency hiding — Don’t reduce wait time; instead, fill the waiting with massive numbers of threads (warps). This requires the register file to permanently hold all warp states
Memory hierarchy — Latency differs by hundreds of times from register to HBM; optimization = keep data in fast storage as much as possible

The next article will dive deep into the GPU’s most important specialized acceleration units — Tensor Core and XMX, understanding why they can make matrix multiplication an order of magnitude faster.