GPU Architecture — From Transistors to Threads
Updated 2026-04-06
In the AI Compute Stack Overview article, we understood the seven-layer structure from inference frameworks to hardware ISA from a software stack perspective. The bottommost layer in that article was “Hardware ISA — the only thing a GPU can execute.”
This article drills one level deeper: what GPU hardware itself looks like. Understanding hardware architecture is a prerequisite for writing high-performance GPU code — you need to know what physical structures instructions execute on before you can make correct optimization decisions.
Section 1: GPU vs CPU — Two Design Philosophies
CPUs and GPUs face two fundamentally different computing demands, and thus took entirely different design paths.
CPU design goal: Minimize the latency of a single task. To achieve this, CPUs invest a large transistor budget in:
- Large multi-level caches (L1/L2/L3, sometimes 50%+ of chip area) — keeping data as close to compute units as possible
- Complex control logic — branch prediction, out-of-order execution, register renaming, making each individual thread run as fast as possible
- A small number of high-performance cores (4-8) — each core is a “heavy infantry”
GPU design goal: Maximize total throughput. A completely different strategy:
- Massive numbers of simple ALUs — thousands of small compute units, each weak individually, but with extremely high aggregate throughput
- Minimized control logic — no complex branch prediction or out-of-order execution, control logic takes a minimal share
- Small caches, using thread switching to hide latency — instead of relying on cache hits to speed things up when encountering memory waits, the GPU immediately switches to another group of threads to continue execution
Latency Hiding: Two Radically Different Strategies
When a CPU encounters a cache miss (data not in cache), the thread stalls — doing nothing, waiting idly for data to return from memory. This waiting time is completely wasted. The CPU’s countermeasure is: use larger caches to reduce the probability of misses.
The GPU’s countermeasure is entirely different: don’t reduce the waiting, but do other work during the wait. A GPU has thousands of threads active simultaneously (grouped into warps). When one group of threads is waiting for memory, the hardware warp scheduler immediately switches to another group of threads that are ready. As long as there are enough concurrently active threads, the waiting time can be completely filled.
This is why GPUs need massive numbers of threads — not because there are that many independent tasks to do, but because enough threads are needed to mask memory latency. This concept is called latency hiding, and it is the most fundamental idea for understanding GPU architecture.
Section 2: NVIDIA GPU Global Structure
Using the NVIDIA H100 as an example, here is the hierarchical structure from chip level to the smallest compute unit:
GPU Chip → GPC (Graphics Processing Cluster) → TPC (Texture Processing Cluster) → SM (Streaming Multiprocessor)
- GPC is the largest logical grouping. The H100 has 8 GPCs.
- Each GPC contains several TPCs (Texture Processing Clusters). TPC is a legacy name (from graphics rendering); in compute scenarios, it’s mainly a container for SMs.
- Each TPC contains 2 SMs — the SM is the GPU’s core compute unit and the level we most need to understand.
- All SMs share a large L2 Cache (~50MB on H100).
- Beyond L2 is HBM (High Bandwidth Memory) — the GPU’s main memory.
H100 SXM full chip: 8 GPC x 9 TPC x 2 SM = 144 SM (132 actually enabled), 50 MB L2, 80 GB HBM3.
For comparison: RTX 4090 has 128 SMs, 96 MB L2, 24 GB GDDR6X. The architecture core is the same (both based on Ada / Hopper SM), but scale and memory technology differ.
Section 3: Inside the SM
The SM (Streaming Multiprocessor) is the unit within the GPU that actually executes computation. Understanding the SM’s internal structure means understanding the GPU’s execution model.
Four Processing Blocks
Each SM is divided into 4 Processing Blocks (also called Sub-partitions), each containing:
- 1 Warp Scheduler + 1 Dispatch Unit — selects and issues instructions
- FP32 CUDA Cores — perform floating-point multiply-add (32 per block on Hopper, 128 per SM total)
- INT32 Cores — integer operations
- Tensor Cores — matrix multiply-accumulate acceleration units (detailed in the next article)
- SFU (Special Function Unit) — computes transcendental functions like sin/cos/exp
- Load/Store Units — load/store data from/to memory
Processing Block Local Resources
Each Processing Block also has its own register file:
- Register File — 256 KB total per SM, physically distributed across the processing blocks (NVIDIA architecture whitepapers show each sub-partition with its own independent register file in SM diagrams). This is the fastest storage on the GPU; each thread can use up to 255 32-bit registers. A warp’s registers are stored in the local register file of its assigned processing block, and warps do not migrate between processing blocks once assigned. The register file is large because the GPU needs to simultaneously maintain the register state of thousands of threads (the key to achieving zero-overhead warp switching)
SM-Level Shared Resources
The 4 Processing Blocks share:
- Shared Memory / L1 Cache — 256 KB (Hopper) of on-chip SRAM, of which up to 228 KB can be configured as shared memory with the remainder used as L1 cache. Physically belongs to the entire SM, but logically partitioned by thread block — all threads within the same thread block (across all warps, across all processing blocks) can access the same shared memory region, while different thread blocks cannot see each other’s shared memory. The allocation ratio between shared memory and L1 cache is configurable
Generational Evolution
| Resource | Ampere (A100) 2020 | Hopper (H100) 2022 | Blackwell (B200) 2024 |
|---|---|---|---|
| SM Count per GPU | 108 | 132↑ | 192↑ |
| FP32 Core / SM | 128 | 128 | 128 |
| INT32 Core / SM | 128 | 64↑ | 64 |
| Tensor Core / SM | 4 (3rd gen) | 4 (4th gen)↑ | 4 (5th gen)↑ |
| Tensor Core Precision | FP16/BF16/TF32/INT8/INT4 | FP16/BF16/TF32/FP8/INT8↑ | FP16/BF16/TF32/FP8/FP4/INT8↑ |
| Register File / SM | 256 KB | 256 KB | 256 KB |
| Shared Memory / SM | 164 KB | 228 KB↑ | 228 KB |
| L1 Cache / SM | 192 KB 共享 | 256 KB 共享↑ | 256 KB 共享 |
| Max Warps / SM | 64 | 64 | 64 |
| Max Threads / SM | 2048 | 2048 | 2048 |
Green arrows ↑ indicate changes from previous generation. Hover to highlight row.
Section 4: Warp — The GPU’s Minimum Execution Unit
What Is a Warp
A Warp = a group of 32 threads, the minimum unit of GPU scheduling and execution. The 32 threads in a warp execute the same instruction in lockstep at the hardware level (SIMT — Single Instruction, Multiple Threads).
Think of it like a platoon of soldiers marching in step — 32 people step left simultaneously, step right simultaneously. They execute the same “instruction” (cadence) but each on different “data” (ground position).
Each Processing Block’s Warp Scheduler selects one ready warp (a warp not waiting on memory) each cycle and issues one instruction for it. The 4 Processing Blocks can each select a warp simultaneously, so an SM can issue up to 4 instructions per cycle.
Warp Divergence
All 32 threads must execute the same instruction — so what happens if the code has an if/else?
This is Warp Divergence: when threads within a warp take different branch paths, the hardware cannot execute both paths simultaneously. The solution is serialization — first execute the if-path (threads not taking the if-path are masked out), then execute the else-path (threads not taking the else-path are masked out). The time for both paths is added together.
A Warp contains 32 threads executing the same instruction (SIMT). All threads synchronously execute z[i] = x[i] + y[i]。
This is why GPU code should minimize branching — not that you can’t write if/else, but that all 32 threads in the same warp should ideally take the same path.
Warp Scheduler and Latency Hiding
The warp scheduler is the SM’s “scheduling brain.” Its job is simple: pick a ready warp and issue its next instruction.
When a warp executes a memory load instruction, it needs to wait hundreds of cycles for the data. The scheduler doesn’t wait — it immediately switches to another ready warp to continue execution. This switch has zero overhead (zero-overhead context switch), because all warps’ register states permanently reside in the SM’s register file, requiring no save/restore.
Warp Scheduler selects Warp 0 to execute compute instructions. After 2 cycles, encounters global memory access — needs to wait ~hundreds of cycles.
Hardware Concepts vs Software Abstractions: How They Map
GPU programming has many concepts with “block” in the name, which can be confusing. The key is distinguishing hardware entities from software abstractions:
| Concept | Type | Description |
|---|---|---|
| SM | Hardware | GPU’s basic compute unit |
| Processing Block | Hardware | Sub-partition inside an SM (each has its own warp scheduler, CUDA cores, Tensor Core) |
| Grid | Software | All threads in a single kernel launch |
| Thread Block | Software | A programmer-defined group of threads (e.g., 256), sharing shared memory, synchronizable via __syncthreads() |
| Warp | Bridge | Every 32 consecutive threads within a thread block are automatically grouped into a warp, scheduled in lockstep by hardware |
Mapping:
- Thread Block → SM: A thread block is assigned as a whole to one SM, sharing that SM’s shared memory. One SM can run multiple thread blocks concurrently (depending on whether register and shared memory resources are sufficient — this is occupancy)
- Warp → Processing Block: Threads within a thread block are automatically divided into warps of 32, with each warp assigned to one of the SM’s processing block’s warp schedulers. Once assigned, a warp stays on that processing block and does not migrate — because the warp’s register state is stored in that processing block’s local register file, and migrating would mean moving up to 32 × 255 × 4 = ~32 KB of register data, which is prohibitively expensive. Switching between warps within the same processing block is zero-overhead — the scheduler simply switches to another warp’s register region with no save/restore needed. This contrasts with CPUs, where threads can migrate between cores (OS load balancing); GPU warps rely on static distribution and massive parallelism for load balancing
- Registers: Private to each thread. Threads within the same warp cannot directly access each other’s registers (but can exchange values through warp shuffle instructions like
__shfl_sync)
Note: “Processing Block” and “Thread Block” are completely different — the former is a hardware partition inside an SM, the latter is a programmer-defined software grouping. The names are similar but there is no one-to-one correspondence.
Why do we need both warp and thread block as two layers of abstraction?
- Warp (32 threads) solves execution efficiency: Under the SIMT model, 32 threads in a warp share a single instruction fetch/decode unit, with hardware dispatching instructions uniformly, while each thread maintains its own independent registers and PC. This way, only 1 set of control logic drives 32 ALUs, dramatically saving transistor area. If a 256-thread thread block didn’t split into warps, it would need 256 independent fetch/decode units — which is impractical
- Thread Block (e.g., 256 threads) solves cooperation needs: Many algorithms need more than 32 threads to share data (e.g., GEMM tiling requires many threads to cooperatively load data into shared memory). Thread blocks provide shared memory and
__syncthreads()synchronization, enabling multiple warps to cooperate
In short: thread blocks manage cooperation, warps manage execution.
How Intel GPUs Differ
The above describes NVIDIA’s SIMT model. Intel Xe2 GPUs take a fundamentally different approach — compiler-driven SIMD. Here’s how the concepts map:
| NVIDIA (SIMT) | Intel Xe2 (SIMD) | Key Difference |
|---|---|---|
| Thread Block | Work-group | Similar: shared local memory, synchronizable |
| Warp (32 threads) | Sub-group (8/16/32 lanes) | NVIDIA groups at runtime in hardware; Intel vectorizes at compile time |
| SM | Xe-core | Similar: contains multiple execution units and shared memory |
| Processing Block | EU (Execution Unit) | NVIDIA has warp scheduler; Intel EU directly executes SIMD instructions |
| Shared Memory | SLM (Shared Local Memory) | Similar: shared within work-group |
| Branch handling | Branch handling | NVIDIA: runtime hardware masking; Intel: compile-time predication (compiler converts if/else to predicated instructions — all lanes execute but a mask controls writeback) |
The core difference: NVIDIA programmers write scalar code, and hardware automatically groups 32 threads into a warp for lockstep execution, dynamically masking on branches. Intel programmers also write scalar code, but the compiler is responsible for vectorizing it into SIMD instructions, with branches converted to predicated execution at compile time. NVIDIA offers more flexibility (hardware handles divergence); Intel’s approach relies more on the compiler but keeps hardware simpler.
For the detailed Intel Xe2 execution model, sub-group operations, and SYCL programming mappings, see Xe2 Execution Model and Programming Abstractions.
Section 5: Memory Hierarchy
The GPU’s memory hierarchy is the single biggest factor affecting performance. From fastest to slowest:
| Level | Capacity (H100) | Bandwidth | Latency | Scope |
|---|---|---|---|---|
| Register File | 256 KB / SM | Extremely high (on-chip) | 0 cycles | Thread-private |
| Shared Memory / L1 | 256 KB / SM (up to 228 KB configurable as shared memory) | Extremely high (on-chip) | ~20-30 cycles | Shared within Thread Block |
| L2 Cache | 50 MB | ~12 TB/s (theoretical) | ~200 cycles | Globally shared |
| HBM3 | 80 GB | 3.35 TB/s | ~400-600 cycles | Global |
Key numbers: The latency gap from register to HBM is several hundred times, and the bandwidth gap is 5-6x. This means:
- Keeping data in registers and shared memory as long as possible is the core of optimization
- Reading data from HBM is expensive — once loaded, it should be reused as many times as possible
- This is the fundamental motivation behind tiling strategies, which we’ll cover in detail in the GEMM optimization article
Flash Attention’s core innovation is precisely based on this memory hierarchy — tiling the Attention computation into SRAM (shared memory) to avoid writing large intermediate matrices back to HBM. See Flash Attention Tiling Principles for details.
Summary
The core design philosophy of GPU architecture can be summarized in three points:
- Throughput-first — Thousands of simple cores, sacrificing single-thread latency for total throughput
- Latency hiding — Don’t reduce wait time; instead, fill the waiting with massive numbers of threads (warps). This requires the register file to permanently hold all warp states
- Memory hierarchy — Latency differs by hundreds of times from register to HBM; optimization = keep data in fast storage as much as possible
The next article will dive deep into the GPU’s most important specialized acceleration units — Tensor Core and XMX, understanding why they can make matrix multiplication an order of magnitude faster.