Matrix Acceleration Units — Tensor Core and XMX

In the GPU Architecture article, we learned about the Processing Block structure inside an SM — each Block has a Warp Scheduler, FP32 CUDA Cores, INT32 Cores, and a Tensor Core.

In that article, Tensor Core was just a name. This article answers the question: what is actually inside it, and why can it speed up matrix multiplication by an order of magnitude. We will also dive into Intel’s counterpart — XMX (Xe Matrix eXtensions) — to understand how the two camps achieve the same goal with different hardware.

Section 1: Why Dedicated Matrix Units Are Needed

Traditional CUDA Cores perform matrix multiplication as element-wise scalar multiply-accumulate. Take a 4×4 matrix multiply as an example:

C(4×4) = A(4×4) × B(4×4)
Each output element requires 4 multiplications + 3 additions = 7 scalar operations
16 output elements × 7 = 112 scalar operations

A Tensor Core completes the entire matrix multiply-accumulate D = A·B + C with a single MMA (Matrix Multiply-Accumulate) instruction.

Matrix Multiplication C = A × B (4×4)

The throughput gap is an order of magnitude: on H100 SXM, FP32 CUDA Core delivers about 67 TFLOPS, while FP16 Tensor Core delivers about 990 TFLOPS — roughly a 15x gap.

In AI training and inference, 90%+ of compute is matrix multiplication (QKV projections, attention scores, FFN are all GEMM). This is why dedicated matrix acceleration units are so important — they directly determine the performance ceiling of AI workloads.

So how do Tensor Core and XMX manage to complete a matrix multiply in one shot? The answer is Systolic Array.

Section 2: Systolic Array — Data Pulsing Through the Array

Basic Concept

A Systolic Array is a computational grid composed of many simple PEs (Processing Elements). The core idea:

Data flows in from the edges, passing between PEs in fixed directions
Each PE performs one Multiply-Accumulate (MAC), then passes data to its neighbor
Data is reused across multiple PEs — a single input element passes through multiple PEs, participating in multiple computations
No need to read from memory every time — extremely high data reuse is the key to efficiency

The name comes from “systole” (heart contraction), describing how data pulses rhythmically through the array like blood flowing.

4×4 Systolic Array Animation

The animation below shows how a 4×4 output-stationary systolic array works. Rows of matrix A flow in from the left, columns of matrix B flow in from the top, and each PE accumulates its responsible output element.

Initial State

Key observations:

Wavefront diagonal — active PEs sweep like a wave from the top-left to the bottom-right
PE(i,j) processes the k-th input pair at cycle i+j+k — staggered inputs ensure data arrives simultaneously
A 4×4 matrix multiply takes 10 cycles (not 1), but the hardware can pipeline multiple matrix groups

Why Not Just Use CUDA Cores in Parallel?

Seeing “10 cycles” you might ask: if we launch 16 CUDA Cores, each computing one element of C (4 MACs), wouldn’t that finish in just 4 cycles? Lower latency — so why do we need systolic arrays?

The advantage of systolic arrays is not latency, but memory access volume and area/energy efficiency:

Memory access comparison (4×4 matrix multiply):

Approach	Computation	Memory Reads
16 CUDA Cores in parallel	Each thread reads one row of A (4) + one column of B (4)	16 × 8 = 128 reads
4×4 Systolic Array	Each element of A flows in from the left, passing through 4 PEs for reuse; same for B	16 + 16 = 32 reads

A and B together have only 32 unique elements, but 16 CUDA Cores each read independently, creating 4× redundant access.

You might think: use shared memory to cache tiles of A and B, so threads read from shared memory instead of repeatedly from global memory. This is indeed the standard GEMM optimization approach, but it introduces extra overhead:

// Phase 1: All threads cooperatively load tile into shared memory
shared_A[ty][tx] = A[row][tx];
shared_B[ty][tx] = B[ty][col];
__syncthreads();   // ← barrier synchronization

// Phase 2: Read from shared memory for computation
sum += shared_A[ty][k] * shared_B[k][tx];

__syncthreads() is a barrier — all threads in the thread block must reach this line before any can proceed. Why is this needed? Because real GEMM thread blocks have far more than 16 threads — a typical configuration is 256 threads spread across 8 warps. The 32 threads within the same warp execute in lockstep, but different warps are independently scheduled by the warp scheduler, and their execution progress can differ entirely. Threads in Warp 0 may need to read data that Warp 3 wrote to shared memory, while Warp 3 may not have reached the write instruction yet. The barrier ensures all warps complete their writes before any begin reading.

Even with shared memory + barriers, all threads still perform numerous shared memory reads (just moved from global to shared memory), plus the synchronization overhead of barrier waits.

Systolic arrays need none of this — data passes between adjacent PEs on a fixed hardware-determined schedule. Reuse happens physically and naturally, with no synchronization instructions and no shared memory access.

For details on the relationship between thread blocks and warps, the scope of shared memory and registers, and how __syncthreads() works, see GPU Architecture — Hardware vs Software Abstractions and CUDA Programming Model.

Area and energy efficiency:

	CUDA Core	Systolic Array PE
Components	Fetch, decode, register file, ALU, branch unit…	One MAC unit + registers
Capability	General-purpose: can execute any instruction	Specialized: multiply-accumulate only
Unit area	Large	Small (more units fit in same area)
Data movement energy	Shared memory access ~5 pJ	Adjacent PE register pass ~0.5 pJ

For the same chip area, systolic arrays can pack more MAC units and feed them with less bandwidth and energy. This is why Tensor Core throughput reaches ~15× that of CUDA Cores — not because a single operation is faster, but because specialized hardware is more efficient in both area and bandwidth.

The 4×4 example makes the gap look small, but at real-world sizes (e.g., 4096×4096 GEMM): the redundant memory accesses in the naive parallel approach grow proportionally with matrix dimensions, and bandwidth bottlenecks become the deciding constraint.

Dataflow Variants

Systolic arrays have multiple dataflow modes, differing in which matrix is “stationary” inside PEs and which “flows”:

Output-Stationary: Output matrix C stays fixed in PEs for accumulation, while A and B both flow through. The advantage is that partial sums don’t need to move
Weight-Stationary: Weight matrix B is preloaded into PEs, input A flows through, and partial sums propagate downward. Suitable for inference (weights are fixed)

The specific implementations of Tensor Core and XMX are proprietary, but both are essentially variants of systolic arrays.

Section 3: NVIDIA Tensor Core

MMA Operation

The core operation of a Tensor Core is matrix multiply-accumulate: D = A × B + C

The original conceptual size is 4×4, but from Volta to Blackwell, the actually supported block sizes have grown larger
Size notation follows the BLAS GEMM convention: $D_{(m \times n)} = A_{(m \times k)} \times B_{(k \times n)} + C_{(m \times n)}$ , where m is the number of output rows, n is the number of output columns, and k is the inner dimension (columns of A / rows of B). For example, m16n16k16 means A(16×16) × B(16×16) → D(16×16), and m16n8k16 means A(16×16) × B(16×8) → D(16×8)
Hopper (4th gen) wmma API commonly uses m16n16k16; at the PTX level there is a smaller m16n8k16
These are the matrix block sizes a single Tensor Core instruction can process. Larger matrix multiplies require software tiling — splitting the large matrix into these small blocks and invoking them repeatedly
Each SM has 4 Tensor Cores (one per Processing Block)

Precision Support Evolution

From Volta (2017) to Blackwell (2024), the precision formats supported by Tensor Cores have continuously expanded:

The trend is crystal clear: lower precision, higher throughput. FP8 throughput is 2x that of FP16, and FP4 doubles it again. This is why FP8 training (e.g., DeepSeek V3) and FP4 quantized inference are becoming trends.

Note: The INT4 and INT1 support from the Turing/Ampere era was removed after Hopper — real AI workloads prefer floating-point low-precision formats like FP8/FP4.

Warp-Level Operation

Tensor Core operations are not initiated by a single thread — they are warp-level cooperative operations. All 32 threads in a warp collectively hold fragments of the input matrices and issue the MMA instruction together.

Threads Hold Fragments

Key takeaways:

A Fragment is a distributed representation of a matrix block across the registers of 32 threads — each thread holds only a portion
wmma::load_matrix_sync loads from memory into fragments, wmma::mma_sync executes the matrix multiply, wmma::store_matrix_sync writes back to memory
The wmma API operates on m16n16k16 blocks; the underlying PTX instruction mma.sync.aligned.m16n8k16 operates on smaller blocks, with the compiler handling the decomposition

Section 4: Intel XMX

Intel’s matrix acceleration unit is called XMX (Xe Matrix eXtensions), a core component of the Xe2 architecture.

Key Specs (Xe2 / Lunar Lake)

Each Xe-Core contains 8 XMX units + 2 Vector Engines
XMX internally is an 8×8 systolic array
Supported precisions: FP16, BF16, TF32, INT8, INT4
Programming interfaces: SYCL joint_matrix (high-level) / ESIMD dpas (low-level)

dp4a vs dpas — two easily confused instructions:

dp4a (Dot Product of 4 elements and Accumulate): computes the dot product of 4 INT8 element pairs and accumulates into INT32 (acc += a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3]). Runs on the Vector Engine, not on XMX. Essentially a vector dot-product instruction, similar to NVIDIA’s __dp4a intrinsic (available since Pascal/Turing). dp4a predates the Xe architecture

dpas (Dot Product Accumulate Systolic): a matrix instruction that drives the XMX systolic array, completing a small matrix block multiply-accumulate in a single instruction, analogous to NVIDIA’s mma.sync (Tensor Core instruction). Much higher throughput than dp4a because the entire 8×8 systolic array works simultaneously

In short: dp4a is a scalar dot product (Vector Engine), dpas is matrix multiply-accumulate (XMX systolic array).

Architecture Comparison with NVIDIA

NVIDIA	Intel	Description
SM	Xe-Core	Basic compute unit
CUDA Core	Vector Engine	Scalar/vector computation
Tensor Core	XMX	Matrix acceleration
Warp (32 threads)	Sub-group (8/16 wide)	SIMD execution unit
Shared Memory	SLM (Shared Local Memory)	On-chip shared storage within SM/Xe-Core
wmma / mma.sync	joint_matrix / dpas	Matrix operation API

The core idea is exactly the same — both use systolic arrays for D = A × B + C. The main differences are in scale (data center GPU vs. client iGPU) and programming model (CUDA’s warp-level vs. SYCL’s sub-group).

Section 5: Tensor Core vs XMX Detailed Comparison

Comparison Dimension	NVIDIA Tensor Core H100 Hopper	Intel XMX Xe2 Lunar Lake
Vendor / Architecture	NVIDIA (Hopper, 4th gen)	Intel (Xe2, Lunar Lake)
Internal Structure	Systolic Array variant	Systolic Array (8×8)(similar)
Core Operation	D = A × B + C (MMA)	D = A × B + C (DPAS)(similar)
Matrix Tile Size (FP16)	m16n8k16	m8n8k16
Count per SM/Xe-Core	4 Tensor Core / SM	8 XMX / Xe-Core
FP16 / BF16	990 TFLOPS (H100 SXM)	~48 TOPS (Lunar Lake iGPU)
FP8 Support	Hopper+ (4th gen)	Xe2+ (Lunar Lake)
FP4 Support	Blackwell+ (5th gen)	Not yet
TF32 Support	Ampere+ (3rd gen)	Xe2+
Programming API (High)	CUDA wmma / mma.sync	SYCL joint_matrix
Programming API (Low)	PTX mma instruction	ESIMD dpas instruction
Warp/Sub-group Cooperation	32-thread warp	8/16-wide sub-group
Target Scenario	Datacenter AI training/inference	Client AI inference (iGPU)

Both share core design: systolic array performs D=A×B+C. Main differences: scale (datacenter vs client), matrix tile size, and programming interface. Hover to highlight row.

The similarities between the two matter more than the differences:

Both are systolic array variants, both compute D = A·B + C
Both can only perform matrix multiplies of specific sizes and precisions — other operations still go through traditional CUDA Cores / Vector Engines
Both require software cooperation (tiling, data alignment) to achieve peak performance — this is the topic of the GEMM optimization article

The key differences are mainly in scale and target scenarios:

H100 Tensor Core peaks at ~990 TFLOPS (FP16), targeting data center AI training
Lunar Lake XMX peaks at ~48 TOPS (INT8), targeting client AI inference
NVIDIA’s 32-wide warp vs. Intel’s 8/16-wide sub-group affects the programming approach

Section 6: Cross-Vendor Standards — Subgroup and Cooperative Matrix

The wmma (NVIDIA) and joint_matrix (Intel SYCL) APIs introduced earlier are vendor-specific. Cross-platform standards like Vulkan / SPIR-V / OpenCL define unified abstractions that allow the same code to run on different vendors’ hardware.

Subgroup: The Cross-Vendor Term for Warp / Wave

A subgroup is a group of threads that execute in lockstep (SIMD/SIMT) on the hardware. Different vendors use different names for this concept:

Vendor	Term	Typical Width
NVIDIA	Warp	32 threads
AMD	Wave / Wavefront	32 or 64 threads
Intel	Sub-group	8 / 16 / 32 threads
Vulkan / SPIR-V	Subgroup	Hardware-dependent

Threads within a subgroup can directly exchange register data without going through shared memory, supporting operations like:

Shuffle: Thread A directly reads thread B’s register value
Broadcast: One thread’s value is broadcast to all threads in the subgroup
Reduction: Direct sum / max within the subgroup

This is much faster than going through shared memory (store → barrier → load), and is a fundamental primitive for high-performance GPU kernels.

Cooperative Matrix: Cross-Vendor Programming Abstraction for Tensor Core / XMX

Earlier we saw that Tensor Core MMA operations are warp-level cooperative — 32 threads jointly hold matrix fragments and issue matrix multiply instructions together. Cooperative matrix standardizes this pattern of “multiple threads within a subgroup cooperating to hold and compute on a matrix”:

No single thread holds the complete matrix; matrix fragments are distributed across registers of threads within the subgroup
Hardware executes matrix multiply-accumulate $D = A \times B + C$ at subgroup granularity
Maps to underlying hardware: NVIDIA Tensor Core, Intel XMX, AMD Matrix Core

API correspondence across layers:

Abstraction Level	NVIDIA	Intel	Cross-Vendor Standard
High-level API	`wmma::mma_sync`	`joint_matrix_mad`	`cooperativeMatrixMulAdd` (Vulkan)
Load	`wmma::load_matrix_sync`	`joint_matrix_load`	`cooperativeMatrixLoad`
Store	`wmma::store_matrix_sync`	`joint_matrix_store`	`cooperativeMatrixStore`
Specification	CUDA wmma API	SYCL `joint_matrix`	`VK_KHR_cooperative_matrix` / `SPV_KHR_cooperative_matrix`

Typical usage flow (using Vulkan cooperative matrix as example):

1. cooperativeMatrixLoad   — Load matrix fragment from VRAM/shared memory into subgroup registers
2. cooperativeMatrixMulAdd — Hardware executes D = A × B + C (one instruction for small matrix MMA)
3. cooperativeMatrixStore  — Write result back to VRAM/shared memory

Available matrix sizes depend on hardware and data types. Programs can query supported size combinations at runtime (e.g., NVIDIA FP16 commonly supports 16×16×16, Intel may support 8×16×32, etc.).

Why Cross-Vendor Standards Matter

Vendor-specific APIs (wmma, joint_matrix) are typically more mature and deeply optimized. But the value of Vulkan cooperative matrix lies in:

Portability: The same SPIR-V shader can run on NVIDIA, AMD, and Intel GPUs
Ecosystem unification: Inference frameworks (e.g., llama.cpp / GGML using the Vulkan backend) don’t need to maintain separate matrix multiply kernels for each vendor
Broad coverage: Vulkan supports desktop, mobile, embedded, and other platforms

In practice, the overhead of cross-vendor standards is usually small — the underlying driver maps cooperative matrix operations to hardware-native matrix instructions (Tensor Core HMMA, Intel DPAS, etc.).

Section 7: Dual-Pipe — Utilizing Both Compute Units Simultaneously

Tensor Core and CUDA Core are different functional units within an SM, with independent execution pipelines. The traditional approach is sequential execution: matrix multiply (Tensor Core) completes before element-wise operations (CUDA Core) begin.

Dual-Pipe optimization breaks this limitation: the input is split into multiple micro-batches, so while the Tensor Core processes batch B’s GEMM, the CUDA Core simultaneously processes batch A’s activation/normalization. There are no data dependencies between different batches, so they can safely overlap.

Serial Execution (baseline)

Conditions for dual-pipe:

No data dependencies between different micro-batches — Tensor Core and CUDA Core process data from different batches
Careful scheduling of micro-batch order and data layout is required
Can significantly improve SM utilization, especially in architectures with many element-wise operations like MoE

Summary

Core design principles of matrix acceleration units:

Dedicated hardware for matrix multiply — Systolic arrays achieve single-instruction completion of full matrix multiply-accumulate through extremely high data reuse
Trading precision for throughput — From FP16 to FP8 to FP4, each precision reduction doubles throughput, and AI workloads can tolerate lower precision
Cooperative execution — Matrix operations are warp/sub-group-level collaborations requiring software cooperation (fragment management, tiling strategies)

The next article will take the programmer’s perspective — the CUDA programming model, covering the thread/block/grid hierarchy, shared memory usage, memory coalescing, and other key concepts. This knowledge is a prerequisite for writing high-performance GPU code (including leveraging Tensor Cores).