Matrix Acceleration Units — Tensor Core and XMX
Updated 2026-04-06
In the GPU Architecture article, we learned about the Processing Block structure inside an SM — each Block has a Warp Scheduler, FP32 CUDA Cores, INT32 Cores, and a Tensor Core.
In that article, Tensor Core was just a name. This article answers the question: what is actually inside it, and why can it speed up matrix multiplication by an order of magnitude. We will also dive into Intel’s counterpart — XMX (Xe Matrix eXtensions) — to understand how the two camps achieve the same goal with different hardware.
Section 1: Why Dedicated Matrix Units Are Needed
Traditional CUDA Cores perform matrix multiplication as element-wise scalar multiply-accumulate. Take a 4×4 matrix multiply as an example:
- C(4×4) = A(4×4) × B(4×4)
- Each output element requires 4 multiplications + 3 additions = 7 scalar operations
- 16 output elements × 7 = 112 scalar operations
A Tensor Core completes the entire matrix multiply-accumulate D = A·B + C with a single MMA (Matrix Multiply-Accumulate) instruction.
The throughput gap is an order of magnitude: on H100 SXM, FP32 CUDA Core delivers about 67 TFLOPS, while FP16 Tensor Core delivers about 990 TFLOPS — roughly a 15x gap.
In AI training and inference, 90%+ of compute is matrix multiplication (QKV projections, attention scores, FFN are all GEMM). This is why dedicated matrix acceleration units are so important — they directly determine the performance ceiling of AI workloads.
So how do Tensor Core and XMX manage to complete a matrix multiply in one shot? The answer is Systolic Array.
Section 2: Systolic Array — Data Pulsing Through the Array
Basic Concept
A Systolic Array is a computational grid composed of many simple PEs (Processing Elements). The core idea:
- Data flows in from the edges, passing between PEs in fixed directions
- Each PE performs one Multiply-Accumulate (MAC), then passes data to its neighbor
- Data is reused across multiple PEs — a single input element passes through multiple PEs, participating in multiple computations
- No need to read from memory every time — extremely high data reuse is the key to efficiency
The name comes from “systole” (heart contraction), describing how data pulses rhythmically through the array like blood flowing.
4×4 Systolic Array Animation
The animation below shows how a 4×4 output-stationary systolic array works. Rows of matrix A flow in from the left, columns of matrix B flow in from the top, and each PE accumulates its responsible output element.
Key observations:
- Wavefront diagonal — active PEs sweep like a wave from the top-left to the bottom-right
- PE(i,j) processes the k-th input pair at cycle i+j+k — staggered inputs ensure data arrives simultaneously
- A 4×4 matrix multiply takes 10 cycles (not 1), but the hardware can pipeline multiple matrix groups
Why Not Just Use CUDA Cores in Parallel?
Seeing “10 cycles” you might ask: if we launch 16 CUDA Cores, each computing one element of C (4 MACs), wouldn’t that finish in just 4 cycles? Lower latency — so why do we need systolic arrays?
The advantage of systolic arrays is not latency, but memory access volume and area/energy efficiency:
Memory access comparison (4×4 matrix multiply):
| Approach | Computation | Memory Reads |
|---|---|---|
| 16 CUDA Cores in parallel | Each thread reads one row of A (4) + one column of B (4) | 16 × 8 = 128 reads |
| 4×4 Systolic Array | Each element of A flows in from the left, passing through 4 PEs for reuse; same for B | 16 + 16 = 32 reads |
A and B together have only 32 unique elements, but 16 CUDA Cores each read independently, creating 4× redundant access.
You might think: use shared memory to cache tiles of A and B, so threads read from shared memory instead of repeatedly from global memory. This is indeed the standard GEMM optimization approach, but it introduces extra overhead:
// Phase 1: All threads cooperatively load tile into shared memory
shared_A[ty][tx] = A[row][tx];
shared_B[ty][tx] = B[ty][col];
__syncthreads(); // ← barrier synchronization
// Phase 2: Read from shared memory for computation
sum += shared_A[ty][k] * shared_B[k][tx];
__syncthreads() is a barrier — all threads in the thread block must reach this line before any can proceed. Why is this needed? Because real GEMM thread blocks have far more than 16 threads — a typical configuration is 256 threads spread across 8 warps. The 32 threads within the same warp execute in lockstep, but different warps are independently scheduled by the warp scheduler, and their execution progress can differ entirely. Threads in Warp 0 may need to read data that Warp 3 wrote to shared memory, while Warp 3 may not have reached the write instruction yet. The barrier ensures all warps complete their writes before any begin reading.
Even with shared memory + barriers, all threads still perform numerous shared memory reads (just moved from global to shared memory), plus the synchronization overhead of barrier waits.
Systolic arrays need none of this — data passes between adjacent PEs on a fixed hardware-determined schedule. Reuse happens physically and naturally, with no synchronization instructions and no shared memory access.
For details on the relationship between thread blocks and warps, the scope of shared memory and registers, and how
__syncthreads()works, see GPU Architecture — Hardware vs Software Abstractions and CUDA Programming Model.
Area and energy efficiency:
| CUDA Core | Systolic Array PE | |
|---|---|---|
| Components | Fetch, decode, register file, ALU, branch unit… | One MAC unit + registers |
| Capability | General-purpose: can execute any instruction | Specialized: multiply-accumulate only |
| Unit area | Large | Small (more units fit in same area) |
| Data movement energy | Shared memory access ~5 pJ | Adjacent PE register pass ~0.5 pJ |
For the same chip area, systolic arrays can pack more MAC units and feed them with less bandwidth and energy. This is why Tensor Core throughput reaches ~15× that of CUDA Cores — not because a single operation is faster, but because specialized hardware is more efficient in both area and bandwidth.
The 4×4 example makes the gap look small, but at real-world sizes (e.g., 4096×4096 GEMM): the redundant memory accesses in the naive parallel approach grow proportionally with matrix dimensions, and bandwidth bottlenecks become the deciding constraint.
Dataflow Variants
Systolic arrays have multiple dataflow modes, differing in which matrix is “stationary” inside PEs and which “flows”:
- Output-Stationary: Output matrix C stays fixed in PEs for accumulation, while A and B both flow through. The advantage is that partial sums don’t need to move
- Weight-Stationary: Weight matrix B is preloaded into PEs, input A flows through, and partial sums propagate downward. Suitable for inference (weights are fixed)
The specific implementations of Tensor Core and XMX are proprietary, but both are essentially variants of systolic arrays.
Section 3: NVIDIA Tensor Core
MMA Operation
The core operation of a Tensor Core is matrix multiply-accumulate: D = A × B + C
- The original conceptual size is 4×4, but from Volta to Blackwell, the actually supported block sizes have grown larger
- Size notation follows the BLAS GEMM convention: , where m is the number of output rows, n is the number of output columns, and k is the inner dimension (columns of A / rows of B). For example,
m16n16k16means A(16×16) × B(16×16) → D(16×16), andm16n8k16means A(16×16) × B(16×8) → D(16×8) - Hopper (4th gen) wmma API commonly uses
m16n16k16; at the PTX level there is a smallerm16n8k16 - These are the matrix block sizes a single Tensor Core instruction can process. Larger matrix multiplies require software tiling — splitting the large matrix into these small blocks and invoking them repeatedly
- Each SM has 4 Tensor Cores (one per Processing Block)
Precision Support Evolution
From Volta (2017) to Blackwell (2024), the precision formats supported by Tensor Cores have continuously expanded:
The trend is crystal clear: lower precision, higher throughput. FP8 throughput is 2x that of FP16, and FP4 doubles it again. This is why FP8 training (e.g., DeepSeek V3) and FP4 quantized inference are becoming trends.
Note: The INT4 and INT1 support from the Turing/Ampere era was removed after Hopper — real AI workloads prefer floating-point low-precision formats like FP8/FP4.
Warp-Level Operation
Tensor Core operations are not initiated by a single thread — they are warp-level cooperative operations. All 32 threads in a warp collectively hold fragments of the input matrices and issue the MMA instruction together.
Key takeaways:
- A Fragment is a distributed representation of a matrix block across the registers of 32 threads — each thread holds only a portion
wmma::load_matrix_syncloads from memory into fragments,wmma::mma_syncexecutes the matrix multiply,wmma::store_matrix_syncwrites back to memory- The wmma API operates on m16n16k16 blocks; the underlying PTX instruction
mma.sync.aligned.m16n8k16operates on smaller blocks, with the compiler handling the decomposition
Section 4: Intel XMX
Intel’s matrix acceleration unit is called XMX (Xe Matrix eXtensions), a core component of the Xe2 architecture.
Key Specs (Xe2 / Lunar Lake)
- Each Xe-Core contains 8 XMX units + 2 Vector Engines
- XMX internally is an 8×8 systolic array
- Supported precisions: FP16, BF16, TF32, INT8, INT4
- Programming interfaces: SYCL
joint_matrix(high-level) / ESIMDdpas(low-level)
dp4a vs dpas — two easily confused instructions:
- dp4a (Dot Product of 4 elements and Accumulate): computes the dot product of 4 INT8 element pairs and accumulates into INT32 (
acc += a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3]). Runs on the Vector Engine, not on XMX. Essentially a vector dot-product instruction, similar to NVIDIA’s__dp4aintrinsic (available since Pascal/Turing). dp4a predates the Xe architecture- dpas (Dot Product Accumulate Systolic): a matrix instruction that drives the XMX systolic array, completing a small matrix block multiply-accumulate in a single instruction, analogous to NVIDIA’s
mma.sync(Tensor Core instruction). Much higher throughput than dp4a because the entire 8×8 systolic array works simultaneouslyIn short: dp4a is a scalar dot product (Vector Engine), dpas is matrix multiply-accumulate (XMX systolic array).
Architecture Comparison with NVIDIA
| NVIDIA | Intel | Description |
|---|---|---|
| SM | Xe-Core | Basic compute unit |
| CUDA Core | Vector Engine | Scalar/vector computation |
| Tensor Core | XMX | Matrix acceleration |
| Warp (32 threads) | Sub-group (8/16 wide) | SIMD execution unit |
| Shared Memory | SLM (Shared Local Memory) | On-chip shared storage within SM/Xe-Core |
| wmma / mma.sync | joint_matrix / dpas | Matrix operation API |
The core idea is exactly the same — both use systolic arrays for D = A × B + C. The main differences are in scale (data center GPU vs. client iGPU) and programming model (CUDA’s warp-level vs. SYCL’s sub-group).
Section 5: Tensor Core vs XMX Detailed Comparison
| Comparison Dimension | NVIDIA Tensor Core H100 Hopper | Intel XMX Xe2 Lunar Lake |
|---|---|---|
| Vendor / Architecture | NVIDIA (Hopper, 4th gen) | Intel (Xe2, Lunar Lake) |
| Internal Structure | Systolic Array variant | Systolic Array (8×8)(similar) |
| Core Operation | D = A × B + C (MMA) | D = A × B + C (DPAS)(similar) |
| Matrix Tile Size (FP16) | m16n8k16 | m8n8k16 |
| Count per SM/Xe-Core | 4 Tensor Core / SM | 8 XMX / Xe-Core |
| FP16 / BF16 | 990 TFLOPS (H100 SXM) | ~48 TOPS (Lunar Lake iGPU) |
| FP8 Support | Hopper+ (4th gen) | Xe2+ (Lunar Lake) |
| FP4 Support | Blackwell+ (5th gen) | Not yet |
| TF32 Support | Ampere+ (3rd gen) | Xe2+ |
| Programming API (High) | CUDA wmma / mma.sync | SYCL joint_matrix |
| Programming API (Low) | PTX mma instruction | ESIMD dpas instruction |
| Warp/Sub-group Cooperation | 32-thread warp | 8/16-wide sub-group |
| Target Scenario | Datacenter AI training/inference | Client AI inference (iGPU) |
Both share core design: systolic array performs D=A×B+C. Main differences: scale (datacenter vs client), matrix tile size, and programming interface. Hover to highlight row.
The similarities between the two matter more than the differences:
- Both are systolic array variants, both compute D = A·B + C
- Both can only perform matrix multiplies of specific sizes and precisions — other operations still go through traditional CUDA Cores / Vector Engines
- Both require software cooperation (tiling, data alignment) to achieve peak performance — this is the topic of the GEMM optimization article
The key differences are mainly in scale and target scenarios:
- H100 Tensor Core peaks at ~990 TFLOPS (FP16), targeting data center AI training
- Lunar Lake XMX peaks at ~48 TOPS (INT8), targeting client AI inference
- NVIDIA’s 32-wide warp vs. Intel’s 8/16-wide sub-group affects the programming approach
Section 6: Cross-Vendor Standards — Subgroup and Cooperative Matrix
The wmma (NVIDIA) and joint_matrix (Intel SYCL) APIs introduced earlier are vendor-specific. Cross-platform standards like Vulkan / SPIR-V / OpenCL define unified abstractions that allow the same code to run on different vendors’ hardware.
Subgroup: The Cross-Vendor Term for Warp / Wave
A subgroup is a group of threads that execute in lockstep (SIMD/SIMT) on the hardware. Different vendors use different names for this concept:
| Vendor | Term | Typical Width |
|---|---|---|
| NVIDIA | Warp | 32 threads |
| AMD | Wave / Wavefront | 32 or 64 threads |
| Intel | Sub-group | 8 / 16 / 32 threads |
| Vulkan / SPIR-V | Subgroup | Hardware-dependent |
Threads within a subgroup can directly exchange register data without going through shared memory, supporting operations like:
- Shuffle: Thread A directly reads thread B’s register value
- Broadcast: One thread’s value is broadcast to all threads in the subgroup
- Reduction: Direct sum / max within the subgroup
This is much faster than going through shared memory (store → barrier → load), and is a fundamental primitive for high-performance GPU kernels.
Cooperative Matrix: Cross-Vendor Programming Abstraction for Tensor Core / XMX
Earlier we saw that Tensor Core MMA operations are warp-level cooperative — 32 threads jointly hold matrix fragments and issue matrix multiply instructions together. Cooperative matrix standardizes this pattern of “multiple threads within a subgroup cooperating to hold and compute on a matrix”:
- No single thread holds the complete matrix; matrix fragments are distributed across registers of threads within the subgroup
- Hardware executes matrix multiply-accumulate at subgroup granularity
- Maps to underlying hardware: NVIDIA Tensor Core, Intel XMX, AMD Matrix Core
API correspondence across layers:
| Abstraction Level | NVIDIA | Intel | Cross-Vendor Standard |
|---|---|---|---|
| High-level API | wmma::mma_sync | joint_matrix_mad | cooperativeMatrixMulAdd (Vulkan) |
| Load | wmma::load_matrix_sync | joint_matrix_load | cooperativeMatrixLoad |
| Store | wmma::store_matrix_sync | joint_matrix_store | cooperativeMatrixStore |
| Specification | CUDA wmma API | SYCL joint_matrix | VK_KHR_cooperative_matrix / SPV_KHR_cooperative_matrix |
Typical usage flow (using Vulkan cooperative matrix as example):
1. cooperativeMatrixLoad — Load matrix fragment from VRAM/shared memory into subgroup registers
2. cooperativeMatrixMulAdd — Hardware executes D = A × B + C (one instruction for small matrix MMA)
3. cooperativeMatrixStore — Write result back to VRAM/shared memory
Available matrix sizes depend on hardware and data types. Programs can query supported size combinations at runtime (e.g., NVIDIA FP16 commonly supports 16×16×16, Intel may support 8×16×32, etc.).
Why Cross-Vendor Standards Matter
Vendor-specific APIs (wmma, joint_matrix) are typically more mature and deeply optimized. But the value of Vulkan cooperative matrix lies in:
- Portability: The same SPIR-V shader can run on NVIDIA, AMD, and Intel GPUs
- Ecosystem unification: Inference frameworks (e.g., llama.cpp / GGML using the Vulkan backend) don’t need to maintain separate matrix multiply kernels for each vendor
- Broad coverage: Vulkan supports desktop, mobile, embedded, and other platforms
In practice, the overhead of cross-vendor standards is usually small — the underlying driver maps cooperative matrix operations to hardware-native matrix instructions (Tensor Core HMMA, Intel DPAS, etc.).
Section 7: Dual-Pipe — Utilizing Both Compute Units Simultaneously
Tensor Core and CUDA Core are different functional units within an SM, with independent execution pipelines. The traditional approach is sequential execution: matrix multiply (Tensor Core) completes before element-wise operations (CUDA Core) begin.
Dual-Pipe optimization breaks this limitation: the input is split into multiple micro-batches, so while the Tensor Core processes batch B’s GEMM, the CUDA Core simultaneously processes batch A’s activation/normalization. There are no data dependencies between different batches, so they can safely overlap.
Conditions for dual-pipe:
- No data dependencies between different micro-batches — Tensor Core and CUDA Core process data from different batches
- Careful scheduling of micro-batch order and data layout is required
- Can significantly improve SM utilization, especially in architectures with many element-wise operations like MoE
Summary
Core design principles of matrix acceleration units:
- Dedicated hardware for matrix multiply — Systolic arrays achieve single-instruction completion of full matrix multiply-accumulate through extremely high data reuse
- Trading precision for throughput — From FP16 to FP8 to FP4, each precision reduction doubles throughput, and AI workloads can tolerate lower precision
- Cooperative execution — Matrix operations are warp/sub-group-level collaborations requiring software cooperation (fragment management, tiling strategies)
The next article will take the programmer’s perspective — the CUDA programming model, covering the thread/block/grid hierarchy, shared memory usage, memory coalescing, and other key concepts. This knowledge is a prerequisite for writing high-performance GPU code (including leveraging Tensor Cores).