Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Matrix Acceleration Units — Tensor Core and XMX

Matrix Acceleration Units — Tensor Core and XMX

Updated 2026-04-06

In the GPU Architecture article, we learned about the Processing Block structure inside an SM — each Block has a Warp Scheduler, FP32 CUDA Cores, INT32 Cores, and a Tensor Core.

In that article, Tensor Core was just a name. This article answers the question: what is actually inside it, and why can it speed up matrix multiplication by an order of magnitude. We will also dive into Intel’s counterpart — XMX (Xe Matrix eXtensions) — to understand how the two camps achieve the same goal with different hardware.

Section 1: Why Dedicated Matrix Units Are Needed

Traditional CUDA Cores perform matrix multiplication as element-wise scalar multiply-accumulate. Take a 4×4 matrix multiply as an example:

  • C(4×4) = A(4×4) × B(4×4)
  • Each output element requires 4 multiplications + 3 additions = 7 scalar operations
  • 16 output elements × 7 = 112 scalar operations

A Tensor Core completes the entire matrix multiply-accumulate D = A·B + C with a single MMA (Matrix Multiply-Accumulate) instruction.

Matrix Multiplication C = A × B (4×4)
Goal: Compute C(4×4) = A(4×4) × B(4×4) — How many operations needed?A (4×4)1230012330122301×B (4×4)1010010111000011=C (4×4)Each output element C[i][j] = 4 multiplies + 3 adds = 7 operations16 output elements × 7 = 112 scalar operations (64 multiplies + 48 adds)CUDA Core: Scalar ops one by one → 112 ops | Tensor Core: One MMA instruction → 1 op

The throughput gap is an order of magnitude: on H100 SXM, FP32 CUDA Core delivers about 67 TFLOPS, while FP16 Tensor Core delivers about 990 TFLOPS — roughly a 15x gap.

In AI training and inference, 90%+ of compute is matrix multiplication (QKV projections, attention scores, FFN are all GEMM). This is why dedicated matrix acceleration units are so important — they directly determine the performance ceiling of AI workloads.

So how do Tensor Core and XMX manage to complete a matrix multiply in one shot? The answer is Systolic Array.


Section 2: Systolic Array — Data Pulsing Through the Array

Basic Concept

A Systolic Array is a computational grid composed of many simple PEs (Processing Elements). The core idea:

  • Data flows in from the edges, passing between PEs in fixed directions
  • Each PE performs one Multiply-Accumulate (MAC), then passes data to its neighbor
  • Data is reused across multiple PEs — a single input element passes through multiple PEs, participating in multiple computations
  • No need to read from memory every time — extremely high data reuse is the key to efficiency

The name comes from “systole” (heart contraction), describing how data pulses rhythmically through the array like blood flowing.

4×4 Systolic Array Animation

The animation below shows how a 4×4 output-stationary systolic array works. Rows of matrix A flow in from the left, columns of matrix B flow in from the top, and each PE accumulates its responsible output element.

Initial State
Output-Stationary Systolic Array (4×4)A (Input)2101131001211013×B (Weight)1021210003121001=C (Output)5143763337254336Each PE computes one element of output matrix. A flows row-wise from left, B flows column-wise from topInputs are staggered by row/column index to ensure A[i][k] and B[k][j] arrive at PE(i,j) simultaneously for same kPE(i,j) processes k-th input pair at cycle t = i+j+k → Total 10 cycles (0~9)

Key observations:

  • Wavefront diagonal — active PEs sweep like a wave from the top-left to the bottom-right
  • PE(i,j) processes the k-th input pair at cycle i+j+k — staggered inputs ensure data arrives simultaneously
  • A 4×4 matrix multiply takes 10 cycles (not 1), but the hardware can pipeline multiple matrix groups

Why Not Just Use CUDA Cores in Parallel?

Seeing “10 cycles” you might ask: if we launch 16 CUDA Cores, each computing one element of C (4 MACs), wouldn’t that finish in just 4 cycles? Lower latency — so why do we need systolic arrays?

The advantage of systolic arrays is not latency, but memory access volume and area/energy efficiency:

Memory access comparison (4×4 matrix multiply):

ApproachComputationMemory Reads
16 CUDA Cores in parallelEach thread reads one row of A (4) + one column of B (4)16 × 8 = 128 reads
4×4 Systolic ArrayEach element of A flows in from the left, passing through 4 PEs for reuse; same for B16 + 16 = 32 reads

A and B together have only 32 unique elements, but 16 CUDA Cores each read independently, creating 4× redundant access.

You might think: use shared memory to cache tiles of A and B, so threads read from shared memory instead of repeatedly from global memory. This is indeed the standard GEMM optimization approach, but it introduces extra overhead:

// Phase 1: All threads cooperatively load tile into shared memory
shared_A[ty][tx] = A[row][tx];
shared_B[ty][tx] = B[ty][col];
__syncthreads();   // ← barrier synchronization

// Phase 2: Read from shared memory for computation
sum += shared_A[ty][k] * shared_B[k][tx];

__syncthreads() is a barrier — all threads in the thread block must reach this line before any can proceed. Why is this needed? Because real GEMM thread blocks have far more than 16 threads — a typical configuration is 256 threads spread across 8 warps. The 32 threads within the same warp execute in lockstep, but different warps are independently scheduled by the warp scheduler, and their execution progress can differ entirely. Threads in Warp 0 may need to read data that Warp 3 wrote to shared memory, while Warp 3 may not have reached the write instruction yet. The barrier ensures all warps complete their writes before any begin reading.

Even with shared memory + barriers, all threads still perform numerous shared memory reads (just moved from global to shared memory), plus the synchronization overhead of barrier waits.

Systolic arrays need none of this — data passes between adjacent PEs on a fixed hardware-determined schedule. Reuse happens physically and naturally, with no synchronization instructions and no shared memory access.

For details on the relationship between thread blocks and warps, the scope of shared memory and registers, and how __syncthreads() works, see GPU Architecture — Hardware vs Software Abstractions and CUDA Programming Model.

Area and energy efficiency:

CUDA CoreSystolic Array PE
ComponentsFetch, decode, register file, ALU, branch unit…One MAC unit + registers
CapabilityGeneral-purpose: can execute any instructionSpecialized: multiply-accumulate only
Unit areaLargeSmall (more units fit in same area)
Data movement energyShared memory access ~5 pJAdjacent PE register pass ~0.5 pJ

For the same chip area, systolic arrays can pack more MAC units and feed them with less bandwidth and energy. This is why Tensor Core throughput reaches ~15× that of CUDA Cores — not because a single operation is faster, but because specialized hardware is more efficient in both area and bandwidth.

The 4×4 example makes the gap look small, but at real-world sizes (e.g., 4096×4096 GEMM): the redundant memory accesses in the naive parallel approach grow proportionally with matrix dimensions, and bandwidth bottlenecks become the deciding constraint.

Dataflow Variants

Systolic arrays have multiple dataflow modes, differing in which matrix is “stationary” inside PEs and which “flows”:

Output-StationaryC 留在 PE,A 和 B 流动PEC[i][j]PEC[i][j]PEC[i][j]PEC[i][j]PEC[i][j]PEC[i][j]PEC[i][j]PEC[i][j]PEC[i][j]A 行 →B 列 ↓固定: 输出矩阵 C部分和 C 留在 PE 中累加,A/B 流过Weight-StationaryB 预加载到 PE,A 流动,部分和传递PEB[i][j]PEB[i][j]PEB[i][j]PEB[i][j]PEB[i][j]PEB[i][j]PEB[i][j]PEB[i][j]PEB[i][j]A 行 →部分和 ↓固定: 权重矩阵 B权重 B 预加载到 PE,A 流过,部分和向下传递
  • Output-Stationary: Output matrix C stays fixed in PEs for accumulation, while A and B both flow through. The advantage is that partial sums don’t need to move
  • Weight-Stationary: Weight matrix B is preloaded into PEs, input A flows through, and partial sums propagate downward. Suitable for inference (weights are fixed)

The specific implementations of Tensor Core and XMX are proprietary, but both are essentially variants of systolic arrays.


Section 3: NVIDIA Tensor Core

MMA Operation

The core operation of a Tensor Core is matrix multiply-accumulate: D = A × B + C

  • The original conceptual size is 4×4, but from Volta to Blackwell, the actually supported block sizes have grown larger
  • Size notation follows the BLAS GEMM convention: D(m×n)=A(m×k)×B(k×n)+C(m×n)D_{(m \times n)} = A_{(m \times k)} \times B_{(k \times n)} + C_{(m \times n)}, where m is the number of output rows, n is the number of output columns, and k is the inner dimension (columns of A / rows of B). For example, m16n16k16 means A(16×16) × B(16×16) → D(16×16), and m16n8k16 means A(16×16) × B(16×8) → D(16×8)
  • Hopper (4th gen) wmma API commonly uses m16n16k16; at the PTX level there is a smaller m16n8k16
  • These are the matrix block sizes a single Tensor Core instruction can process. Larger matrix multiplies require software tiling — splitting the large matrix into these small blocks and invoking them repeatedly
  • Each SM has 4 Tensor Cores (one per Processing Block)

Precision Support Evolution

From Volta (2017) to Blackwell (2024), the precision formats supported by Tensor Cores have continuously expanded:

Tensor Core 精度支持演进Volta2017 · 1st genFP16NEWTuring2018 · 2nd genINT8INT4INT1NEWAmpere2020 · 3rd genTF32BF16FP64NEWHopper2022 · 4th genFP8NEWBlackwell2024 · 5th genFP4NEWBlackwell (5th gen) 支持全部精度:FP16BF16TF32FP64FP8FP4INT8趋势: 精度越来越低 → 吞吐量越来越高(每降一级精度,吞吐量约翻倍)

The trend is crystal clear: lower precision, higher throughput. FP8 throughput is 2x that of FP16, and FP4 doubles it again. This is why FP8 training (e.g., DeepSeek V3) and FP4 quantized inference are becoming trends.

Note: The INT4 and INT1 support from the Turing/Ampere era was removed after Hopper — real AI workloads prefer floating-point low-precision formats like FP8/FP4.

Warp-Level Operation

Tensor Core operations are not initiated by a single thread — they are warp-level cooperative operations. All 32 threads in a warp collectively hold fragments of the input matrices and issue the MMA instruction together.

Threads Hold Fragments
Step 1: 32 threads in Warp each hold matrix fragmentWarp (32 threads)wmma::fragment — matrix block distributed across 32 thread registersFragment A16×16 FP16 — 8 elements per threadFragment B16×16 FP16 — 8 elements per threadFragment C (accumulator)16×16 FP32 — 8 elements per threadThread 0 registers: 8 FP16 elements of A + 8 FP16 elements of B + 8 FP32 elements of C32 threads together = complete 16×16 matrix block (each thread sees only its portion)wmma::load_matrix_sync(frag_a, A_ptr, lda);wmma::load_matrix_sync(frag_b, B_ptr, ldb);

Key takeaways:

  • A Fragment is a distributed representation of a matrix block across the registers of 32 threads — each thread holds only a portion
  • wmma::load_matrix_sync loads from memory into fragments, wmma::mma_sync executes the matrix multiply, wmma::store_matrix_sync writes back to memory
  • The wmma API operates on m16n16k16 blocks; the underlying PTX instruction mma.sync.aligned.m16n8k16 operates on smaller blocks, with the compiler handling the decomposition

Section 4: Intel XMX

Intel’s matrix acceleration unit is called XMX (Xe Matrix eXtensions), a core component of the Xe2 architecture.

Intel Xe-Core 内部结构 — XMX 矩阵引擎Xe2 架构 (Lunar Lake / Panther Lake) — 每 Xe-Core 含 8 个 XMX 单元Xe-CoreVector Engine 0FP32/FP16/INT 8-wide SIMDVector Engine 1FP32/FP16/INT 8-wide SIMDXMX 矩阵引擎 (×8)XMX 0A →B ↓XMX 1A →B ↓XMX 2A →B ↓XMX 3A →B ↓XMX 4A →B ↓XMX 5A →B ↓XMX 6A →B ↓XMX 7A →B ↓Thread ControlSIMD 调度SLM (Shared Local Memory) — 64 KBL1 Cache / Instruction Cache每 XMX 单元: 8×8 systolic array, 支持 FP16/BF16/TF32/INT8/INT4 | D(M×N) = A(M×K) × B(K×N) + C(M×N), M/N/K 取决于精度 | 编程: SYCL joint_matrix / ESIMD

Key Specs (Xe2 / Lunar Lake)

  • Each Xe-Core contains 8 XMX units + 2 Vector Engines
  • XMX internally is an 8×8 systolic array
  • Supported precisions: FP16, BF16, TF32, INT8, INT4
  • Programming interfaces: SYCL joint_matrix (high-level) / ESIMD dpas (low-level)

dp4a vs dpas — two easily confused instructions:

  • dp4a (Dot Product of 4 elements and Accumulate): computes the dot product of 4 INT8 element pairs and accumulates into INT32 (acc += a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3]). Runs on the Vector Engine, not on XMX. Essentially a vector dot-product instruction, similar to NVIDIA’s __dp4a intrinsic (available since Pascal/Turing). dp4a predates the Xe architecture
  • dpas (Dot Product Accumulate Systolic): a matrix instruction that drives the XMX systolic array, completing a small matrix block multiply-accumulate in a single instruction, analogous to NVIDIA’s mma.sync (Tensor Core instruction). Much higher throughput than dp4a because the entire 8×8 systolic array works simultaneously

In short: dp4a is a scalar dot product (Vector Engine), dpas is matrix multiply-accumulate (XMX systolic array).

Architecture Comparison with NVIDIA

NVIDIAIntelDescription
SMXe-CoreBasic compute unit
CUDA CoreVector EngineScalar/vector computation
Tensor CoreXMXMatrix acceleration
Warp (32 threads)Sub-group (8/16 wide)SIMD execution unit
Shared MemorySLM (Shared Local Memory)On-chip shared storage within SM/Xe-Core
wmma / mma.syncjoint_matrix / dpasMatrix operation API

The core idea is exactly the same — both use systolic arrays for D = A × B + C. The main differences are in scale (data center GPU vs. client iGPU) and programming model (CUDA’s warp-level vs. SYCL’s sub-group).


Section 5: Tensor Core vs XMX Detailed Comparison

Comparison DimensionNVIDIA Tensor Core
H100 Hopper
Intel XMX
Xe2 Lunar Lake
Vendor / ArchitectureNVIDIA (Hopper, 4th gen)Intel (Xe2, Lunar Lake)
Internal StructureSystolic Array variantSystolic Array (8×8)(similar)
Core OperationD = A × B + C (MMA)D = A × B + C (DPAS)(similar)
Matrix Tile Size (FP16)m16n8k16m8n8k16
Count per SM/Xe-Core4 Tensor Core / SM8 XMX / Xe-Core
FP16 / BF16990 TFLOPS (H100 SXM)~48 TOPS (Lunar Lake iGPU)
FP8 SupportHopper+ (4th gen)Xe2+ (Lunar Lake)
FP4 SupportBlackwell+ (5th gen)Not yet
TF32 SupportAmpere+ (3rd gen)Xe2+
Programming API (High)CUDA wmma / mma.syncSYCL joint_matrix
Programming API (Low)PTX mma instructionESIMD dpas instruction
Warp/Sub-group Cooperation32-thread warp8/16-wide sub-group
Target ScenarioDatacenter AI training/inferenceClient AI inference (iGPU)

Both share core design: systolic array performs D=A×B+C. Main differences: scale (datacenter vs client), matrix tile size, and programming interface. Hover to highlight row.

The similarities between the two matter more than the differences:

  • Both are systolic array variants, both compute D = A·B + C
  • Both can only perform matrix multiplies of specific sizes and precisions — other operations still go through traditional CUDA Cores / Vector Engines
  • Both require software cooperation (tiling, data alignment) to achieve peak performance — this is the topic of the GEMM optimization article

The key differences are mainly in scale and target scenarios:

  • H100 Tensor Core peaks at ~990 TFLOPS (FP16), targeting data center AI training
  • Lunar Lake XMX peaks at ~48 TOPS (INT8), targeting client AI inference
  • NVIDIA’s 32-wide warp vs. Intel’s 8/16-wide sub-group affects the programming approach

Section 6: Cross-Vendor Standards — Subgroup and Cooperative Matrix

The wmma (NVIDIA) and joint_matrix (Intel SYCL) APIs introduced earlier are vendor-specific. Cross-platform standards like Vulkan / SPIR-V / OpenCL define unified abstractions that allow the same code to run on different vendors’ hardware.

Subgroup: The Cross-Vendor Term for Warp / Wave

A subgroup is a group of threads that execute in lockstep (SIMD/SIMT) on the hardware. Different vendors use different names for this concept:

VendorTermTypical Width
NVIDIAWarp32 threads
AMDWave / Wavefront32 or 64 threads
IntelSub-group8 / 16 / 32 threads
Vulkan / SPIR-VSubgroupHardware-dependent

Threads within a subgroup can directly exchange register data without going through shared memory, supporting operations like:

  • Shuffle: Thread A directly reads thread B’s register value
  • Broadcast: One thread’s value is broadcast to all threads in the subgroup
  • Reduction: Direct sum / max within the subgroup

This is much faster than going through shared memory (store → barrier → load), and is a fundamental primitive for high-performance GPU kernels.

Cooperative Matrix: Cross-Vendor Programming Abstraction for Tensor Core / XMX

Earlier we saw that Tensor Core MMA operations are warp-level cooperative — 32 threads jointly hold matrix fragments and issue matrix multiply instructions together. Cooperative matrix standardizes this pattern of “multiple threads within a subgroup cooperating to hold and compute on a matrix”:

  • No single thread holds the complete matrix; matrix fragments are distributed across registers of threads within the subgroup
  • Hardware executes matrix multiply-accumulate D=A×B+CD = A \times B + C at subgroup granularity
  • Maps to underlying hardware: NVIDIA Tensor Core, Intel XMX, AMD Matrix Core

API correspondence across layers:

Abstraction LevelNVIDIAIntelCross-Vendor Standard
High-level APIwmma::mma_syncjoint_matrix_madcooperativeMatrixMulAdd (Vulkan)
Loadwmma::load_matrix_syncjoint_matrix_loadcooperativeMatrixLoad
Storewmma::store_matrix_syncjoint_matrix_storecooperativeMatrixStore
SpecificationCUDA wmma APISYCL joint_matrixVK_KHR_cooperative_matrix / SPV_KHR_cooperative_matrix

Typical usage flow (using Vulkan cooperative matrix as example):

1. cooperativeMatrixLoad   — Load matrix fragment from VRAM/shared memory into subgroup registers
2. cooperativeMatrixMulAdd — Hardware executes D = A × B + C (one instruction for small matrix MMA)
3. cooperativeMatrixStore  — Write result back to VRAM/shared memory

Available matrix sizes depend on hardware and data types. Programs can query supported size combinations at runtime (e.g., NVIDIA FP16 commonly supports 16×16×16, Intel may support 8×16×32, etc.).

Why Cross-Vendor Standards Matter

Vendor-specific APIs (wmma, joint_matrix) are typically more mature and deeply optimized. But the value of Vulkan cooperative matrix lies in:

  • Portability: The same SPIR-V shader can run on NVIDIA, AMD, and Intel GPUs
  • Ecosystem unification: Inference frameworks (e.g., llama.cpp / GGML using the Vulkan backend) don’t need to maintain separate matrix multiply kernels for each vendor
  • Broad coverage: Vulkan supports desktop, mobile, embedded, and other platforms

In practice, the overhead of cross-vendor standards is usually small — the underlying driver maps cooperative matrix operations to hardware-native matrix instructions (Tensor Core HMMA, Intel DPAS, etc.).


Section 7: Dual-Pipe — Utilizing Both Compute Units Simultaneously

Tensor Core and CUDA Core are different functional units within an SM, with independent execution pipelines. The traditional approach is sequential execution: matrix multiply (Tensor Core) completes before element-wise operations (CUDA Core) begin.

Dual-Pipe optimization breaks this limitation: the input is split into multiple micro-batches, so while the Tensor Core processes batch B’s GEMM, the CUDA Core simultaneously processes batch A’s activation/normalization. There are no data dependencies between different batches, so they can safely overlap.

Serial Execution (baseline)
Traditional Serial: GEMM (Tensor Core) → Element-wise (CUDA Core) → GEMM → ...Tensor CoreCUDA CoretimeGEMM Layer 1Tensor Core(Idle)(Idle)Act/NormCUDA CoreGEMM Layer 2Tensor Core(Idle)(Idle)Act/NormCUDA CoreTotal time = GEMM + Act/Norm + GEMM + Act/Norm (serial accumulation)Problem: Tensor Core and CUDA Core idle alternately, low SM utilizationCUDA Core idle during GEMM | Tensor Core idle during Act/Norm

Conditions for dual-pipe:

  • No data dependencies between different micro-batches — Tensor Core and CUDA Core process data from different batches
  • Careful scheduling of micro-batch order and data layout is required
  • Can significantly improve SM utilization, especially in architectures with many element-wise operations like MoE

Summary

Core design principles of matrix acceleration units:

  1. Dedicated hardware for matrix multiply — Systolic arrays achieve single-instruction completion of full matrix multiply-accumulate through extremely high data reuse
  2. Trading precision for throughput — From FP16 to FP8 to FP4, each precision reduction doubles throughput, and AI workloads can tolerate lower precision
  3. Cooperative execution — Matrix operations are warp/sub-group-level collaborations requiring software cooperation (fragment management, tiling strategies)

The next article will take the programmer’s perspective — the CUDA programming model, covering the thread/block/grid hierarchy, shared memory usage, memory coalescing, and other key concepts. This knowledge is a prerequisite for writing high-performance GPU code (including leveraging Tensor Cores).