Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

GPU Architecture — From Transistors to Threads

GPU Architecture — From Transistors to Threads

Updated 2026-04-06

In the AI Compute Stack Overview article, we understood the seven-layer structure from inference frameworks to hardware ISA from a software stack perspective. The bottommost layer in that article was “Hardware ISA — the only thing a GPU can execute.”

This article drills one level deeper: what GPU hardware itself looks like. Understanding hardware architecture is a prerequisite for writing high-performance GPU code — you need to know what physical structures instructions execute on before you can make correct optimization decisions.

Section 1: GPU vs CPU — Two Design Philosophies

CPUs and GPUs face two fundamentally different computing demands, and thus took entirely different design paths.

CPU design goal: Minimize the latency of a single task. To achieve this, CPUs invest a large transistor budget in:

  • Large multi-level caches (L1/L2/L3, sometimes 50%+ of chip area) — keeping data as close to compute units as possible
  • Complex control logic — branch prediction, out-of-order execution, register renaming, making each individual thread run as fast as possible
  • A small number of high-performance cores (4-8) — each core is a “heavy infantry”

GPU design goal: Maximize total throughput. A completely different strategy:

  • Massive numbers of simple ALUs — thousands of small compute units, each weak individually, but with extremely high aggregate throughput
  • Minimized control logic — no complex branch prediction or out-of-order execution, control logic takes a minimal share
  • Small caches, using thread switching to hide latency — instead of relying on cache hits to speed things up when encountering memory waits, the GPU immediately switches to another group of threads to continue execution
CPU — Latency OptimizedGPU — Throughput OptimizedCache (L1/L2/L3)~50% transistor budgetControl LogicBranch prediction / OoO executionALU4-8 strong coresMemory Controller + IOThousands of small ALUs (SM / CU)~80% transistor budgetL2 CacheSmallControlMem CtrlCPU: Few strong cores, low latency · GPU: Thousands weak cores, high total throughput

Latency Hiding: Two Radically Different Strategies

When a CPU encounters a cache miss (data not in cache), the thread stalls — doing nothing, waiting idly for data to return from memory. This waiting time is completely wasted. The CPU’s countermeasure is: use larger caches to reduce the probability of misses.

The GPU’s countermeasure is entirely different: don’t reduce the waiting, but do other work during the wait. A GPU has thousands of threads active simultaneously (grouped into warps). When one group of threads is waiting for memory, the hardware warp scheduler immediately switches to another group of threads that are ready. As long as there are enough concurrently active threads, the waiting time can be completely filled.

CPU Single ThreadExecExecMissWaitWaitWaitWaitDataExecExecMissWaitWaitDataExecCache miss → thread stalls waiting for data (red = wasted cycles)GPU Multi-WarpW0W1W2W0W1W2W0W1W2W0W1W2W0W1W2When Warp 0 waits, switch to Warp 1/2 → no idle cycles (each color = different warp)Utilization ~33%Utilization ~100%

This is why GPUs need massive numbers of threads — not because there are that many independent tasks to do, but because enough threads are needed to mask memory latency. This concept is called latency hiding, and it is the most fundamental idea for understanding GPU architecture.


Section 2: NVIDIA GPU Global Structure

Using the NVIDIA H100 as an example, here is the hierarchical structure from chip level to the smallest compute unit:

GPU ChipGPC (Graphics Processing Cluster) → TPC (Texture Processing Cluster) → SM (Streaming Multiprocessor)

  • GPC is the largest logical grouping. The H100 has 8 GPCs.
  • Each GPC contains several TPCs (Texture Processing Clusters). TPC is a legacy name (from graphics rendering); in compute scenarios, it’s mainly a container for SMs.
  • Each TPC contains 2 SMs — the SM is the GPU’s core compute unit and the level we most need to understand.
  • All SMs share a large L2 Cache (~50MB on H100).
  • Beyond L2 is HBM (High Bandwidth Memory) — the GPU’s main memory.

H100 SXM full chip: 8 GPC x 9 TPC x 2 SM = 144 SM (132 actually enabled), 50 MB L2, 80 GB HBM3.

H100 SXM132 SMsGPC 0 2 TPC × 2 SMGPC 1 2 TPC × 2 SMGPC 2 2 TPC × 2 SMGPC 3 2 TPC × 2 SMGPC 4 2 TPC × 2 SMGPC 5 2 TPC × 2 SMGPC 6 2 TPC × 2 SMGPC 7 2 TPC × 2 SML2 Cache — 50 MB80 GB HBM3 — 3.35 TB/sClick any GPC to expand and view internal TPC → SM structure

For comparison: RTX 4090 has 128 SMs, 96 MB L2, 24 GB GDDR6X. The architecture core is the same (both based on Ada / Hopper SM), but scale and memory technology differ.


Section 3: Inside the SM

The SM (Streaming Multiprocessor) is the unit within the GPU that actually executes computation. Understanding the SM’s internal structure means understanding the GPU’s execution model.

Four Processing Blocks

Each SM is divided into 4 Processing Blocks (also called Sub-partitions), each containing:

  • 1 Warp Scheduler + 1 Dispatch Unit — selects and issues instructions
  • FP32 CUDA Cores — perform floating-point multiply-add (32 per block on Hopper, 128 per SM total)
  • INT32 Cores — integer operations
  • Tensor Cores — matrix multiply-accumulate acceleration units (detailed in the next article)
  • SFU (Special Function Unit) — computes transcendental functions like sin/cos/exp
  • Load/Store Units — load/store data from/to memory
Streaming Multiprocessor (SM) — Hopper ArchitectureEach SM contains 4 Processing Blocks (Sub-partitions), each with independent Warp SchedulerProcessing Block 0Warp Scheduler×1Dispatch Unit×1FP32 CUDA Core×32INT32 Core×16FP64 Core×16Tensor Core×1SFU (sin/cos/exp)×4Load/Store Unit×8Processing Block 1Warp Scheduler×1Dispatch Unit×1FP32 CUDA Core×32INT32 Core×16FP64 Core×16Tensor Core×1SFU (sin/cos/exp)×4Load/Store Unit×8Processing Block 2Warp Scheduler×1Dispatch Unit×1FP32 CUDA Core×32INT32 Core×16FP64 Core×16Tensor Core×1SFU (sin/cos/exp)×4Load/Store Unit×8Processing Block 3Warp Scheduler×1Dispatch Unit×1FP32 CUDA Core×32INT32 Core×16FP64 Core×16Tensor Core×1SFU (sin/cos/exp)×4Load/Store Unit×8Register File — 256 KBMax 255 32-bit registers per threadShared Memory / L1 Cache — 228 KBConfigurable allocation ratio (more shared or more L1)Control UnitsCompute UnitsSpecial UnitsMemory Units

Processing Block Local Resources

Each Processing Block also has its own register file:

  • Register File — 256 KB total per SM, physically distributed across the processing blocks (NVIDIA architecture whitepapers show each sub-partition with its own independent register file in SM diagrams). This is the fastest storage on the GPU; each thread can use up to 255 32-bit registers. A warp’s registers are stored in the local register file of its assigned processing block, and warps do not migrate between processing blocks once assigned. The register file is large because the GPU needs to simultaneously maintain the register state of thousands of threads (the key to achieving zero-overhead warp switching)

SM-Level Shared Resources

The 4 Processing Blocks share:

  • Shared Memory / L1 Cache — 256 KB (Hopper) of on-chip SRAM, of which up to 228 KB can be configured as shared memory with the remainder used as L1 cache. Physically belongs to the entire SM, but logically partitioned by thread block — all threads within the same thread block (across all warps, across all processing blocks) can access the same shared memory region, while different thread blocks cannot see each other’s shared memory. The allocation ratio between shared memory and L1 cache is configurable

Generational Evolution

ResourceAmpere (A100)
2020
Hopper (H100)
2022
Blackwell (B200)
2024
SM Count per GPU108132192
FP32 Core / SM128128128
INT32 Core / SM1286464
Tensor Core / SM4 (3rd gen)4 (4th gen)4 (5th gen)
Tensor Core PrecisionFP16/BF16/TF32/INT8/INT4FP16/BF16/TF32/FP8/INT8FP16/BF16/TF32/FP8/FP4/INT8
Register File / SM256 KB256 KB256 KB
Shared Memory / SM164 KB228 KB228 KB
L1 Cache / SM192 KB 共享256 KB 共享256 KB 共享
Max Warps / SM646464
Max Threads / SM204820482048

Green arrows ↑ indicate changes from previous generation. Hover to highlight row.


Section 4: Warp — The GPU’s Minimum Execution Unit

What Is a Warp

A Warp = a group of 32 threads, the minimum unit of GPU scheduling and execution. The 32 threads in a warp execute the same instruction in lockstep at the hardware level (SIMT — Single Instruction, Multiple Threads).

Think of it like a platoon of soldiers marching in step — 32 people step left simultaneously, step right simultaneously. They execute the same “instruction” (cadence) but each on different “data” (ground position).

Each Processing Block’s Warp Scheduler selects one ready warp (a warp not waiting on memory) each cycle and issues one instruction for it. The 4 Processing Blocks can each select a warp simultaneously, so an SM can issue up to 4 instructions per cycle.

Warp Divergence

All 32 threads must execute the same instruction — so what happens if the code has an if/else?

This is Warp Divergence: when threads within a warp take different branch paths, the hardware cannot execute both paths simultaneously. The solution is serialization — first execute the if-path (threads not taking the if-path are masked out), then execute the else-path (threads not taking the else-path are masked out). The time for both paths is added together.

Normal Execution: 32 Threads in Lockstep

A Warp contains 32 threads executing the same instruction (SIMT). All threads synchronously execute z[i] = x[i] + y[i]

Warp 0 — All threads execute z[i] = x[i] + y[i]
012345678910111213141516171819202122232425262728293031
32 threads all active, hardware efficiency 100%
Active
Masked (waiting)

This is why GPU code should minimize branching — not that you can’t write if/else, but that all 32 threads in the same warp should ideally take the same path.

Warp Scheduler and Latency Hiding

The warp scheduler is the SM’s “scheduling brain.” Its job is simple: pick a ready warp and issue its next instruction.

When a warp executes a memory load instruction, it needs to wait hundreds of cycles for the data. The scheduler doesn’t wait — it immediately switches to another ready warp to continue execution. This switch has zero overhead (zero-overhead context switch), because all warps’ register states permanently reside in the SM’s register file, requiring no save/restore.

Cycle 0-1: Warp 0 Executes

Warp Scheduler selects Warp 0 to execute compute instructions. After 2 cycles, encounters global memory access — needs to wait ~hundreds of cycles.

Warp 0RUNRUNMEMMEMMEMMEMRUNRUNMEMMEMMEMMEMRUNRUNMEMMEMWarp 1RUNRUNMEMMEMMEMMEMRUNRUNMEMMEMMEMMEMRUNRUNWarp 2RUNRUNMEMMEMMEMMEMRUNRUNMEMMEMMEMMEMWarp 3
Key: While Warp 0 waits for memory, it doesn't block the entire SM — scheduler immediately switches to another ready warp

Hardware Concepts vs Software Abstractions: How They Map

GPU programming has many concepts with “block” in the name, which can be confusing. The key is distinguishing hardware entities from software abstractions:

ConceptTypeDescription
SMHardwareGPU’s basic compute unit
Processing BlockHardwareSub-partition inside an SM (each has its own warp scheduler, CUDA cores, Tensor Core)
GridSoftwareAll threads in a single kernel launch
Thread BlockSoftwareA programmer-defined group of threads (e.g., 256), sharing shared memory, synchronizable via __syncthreads()
WarpBridgeEvery 32 consecutive threads within a thread block are automatically grouped into a warp, scheduled in lockstep by hardware

Mapping:

  • Thread Block → SM: A thread block is assigned as a whole to one SM, sharing that SM’s shared memory. One SM can run multiple thread blocks concurrently (depending on whether register and shared memory resources are sufficient — this is occupancy)
  • Warp → Processing Block: Threads within a thread block are automatically divided into warps of 32, with each warp assigned to one of the SM’s processing block’s warp schedulers. Once assigned, a warp stays on that processing block and does not migrate — because the warp’s register state is stored in that processing block’s local register file, and migrating would mean moving up to 32 × 255 × 4 = ~32 KB of register data, which is prohibitively expensive. Switching between warps within the same processing block is zero-overhead — the scheduler simply switches to another warp’s register region with no save/restore needed. This contrasts with CPUs, where threads can migrate between cores (OS load balancing); GPU warps rely on static distribution and massive parallelism for load balancing
  • Registers: Private to each thread. Threads within the same warp cannot directly access each other’s registers (but can exchange values through warp shuffle instructions like __shfl_sync)

Note: “Processing Block” and “Thread Block” are completely different — the former is a hardware partition inside an SM, the latter is a programmer-defined software grouping. The names are similar but there is no one-to-one correspondence.

Why do we need both warp and thread block as two layers of abstraction?

  • Warp (32 threads) solves execution efficiency: Under the SIMT model, 32 threads in a warp share a single instruction fetch/decode unit, with hardware dispatching instructions uniformly, while each thread maintains its own independent registers and PC. This way, only 1 set of control logic drives 32 ALUs, dramatically saving transistor area. If a 256-thread thread block didn’t split into warps, it would need 256 independent fetch/decode units — which is impractical
  • Thread Block (e.g., 256 threads) solves cooperation needs: Many algorithms need more than 32 threads to share data (e.g., GEMM tiling requires many threads to cooperatively load data into shared memory). Thread blocks provide shared memory and __syncthreads() synchronization, enabling multiple warps to cooperate

In short: thread blocks manage cooperation, warps manage execution.

How Intel GPUs Differ

The above describes NVIDIA’s SIMT model. Intel Xe2 GPUs take a fundamentally different approach — compiler-driven SIMD. Here’s how the concepts map:

NVIDIA (SIMT)Intel Xe2 (SIMD)Key Difference
Thread BlockWork-groupSimilar: shared local memory, synchronizable
Warp (32 threads)Sub-group (8/16/32 lanes)NVIDIA groups at runtime in hardware; Intel vectorizes at compile time
SMXe-coreSimilar: contains multiple execution units and shared memory
Processing BlockEU (Execution Unit)NVIDIA has warp scheduler; Intel EU directly executes SIMD instructions
Shared MemorySLM (Shared Local Memory)Similar: shared within work-group
Branch handlingBranch handlingNVIDIA: runtime hardware masking; Intel: compile-time predication (compiler converts if/else to predicated instructions — all lanes execute but a mask controls writeback)

The core difference: NVIDIA programmers write scalar code, and hardware automatically groups 32 threads into a warp for lockstep execution, dynamically masking on branches. Intel programmers also write scalar code, but the compiler is responsible for vectorizing it into SIMD instructions, with branches converted to predicated execution at compile time. NVIDIA offers more flexibility (hardware handles divergence); Intel’s approach relies more on the compiler but keeps hardware simpler.

For the detailed Intel Xe2 execution model, sub-group operations, and SYCL programming mappings, see Xe2 Execution Model and Programming Abstractions.


Section 5: Memory Hierarchy

The GPU’s memory hierarchy is the single biggest factor affecting performance. From fastest to slowest:

LevelCapacity (H100)BandwidthLatencyScope
Register File256 KB / SMExtremely high (on-chip)0 cyclesThread-private
Shared Memory / L1256 KB / SM (up to 228 KB configurable as shared memory)Extremely high (on-chip)~20-30 cyclesShared within Thread Block
L2 Cache50 MB~12 TB/s (theoretical)~200 cyclesGlobally shared
HBM380 GB3.35 TB/s~400-600 cyclesGlobal
GPU Memory Hierarchy (H100 Reference)Register File256 KB / SMBW Very High (on-chip) · Latency 0 cyclesslower/largerShared Memory / L1228 KB / SMBW Very High (on-chip) · Latency ~20-30 cyclesslower/largerL2 Cache50 MB (global shared)BW ~12 TB/s (theoretical) · Latency ~200 cyclesslower/largerHBM3 (Global Memory)80 GBBW 3.35 TB/s · Latency ~400-600 cyclesOptimization key: keep data at pyramid top (Register / Shared Memory)

Key numbers: The latency gap from register to HBM is several hundred times, and the bandwidth gap is 5-6x. This means:

  • Keeping data in registers and shared memory as long as possible is the core of optimization
  • Reading data from HBM is expensive — once loaded, it should be reused as many times as possible
  • This is the fundamental motivation behind tiling strategies, which we’ll cover in detail in the GEMM optimization article

Flash Attention’s core innovation is precisely based on this memory hierarchy — tiling the Attention computation into SRAM (shared memory) to avoid writing large intermediate matrices back to HBM. See Flash Attention Tiling Principles for details.


Summary

The core design philosophy of GPU architecture can be summarized in three points:

  1. Throughput-first — Thousands of simple cores, sacrificing single-thread latency for total throughput
  2. Latency hiding — Don’t reduce wait time; instead, fill the waiting with massive numbers of threads (warps). This requires the register file to permanently hold all warp states
  3. Memory hierarchy — Latency differs by hundreds of times from register to HBM; optimization = keep data in fast storage as much as possible

The next article will dive deep into the GPU’s most important specialized acceleration units — Tensor Core and XMX, understanding why they can make matrix multiplication an order of magnitude faster.