Xe2 GPU Architecture | LLM Learning

Introduction

Intel Xe2 GPU architecture represents Intel’s latest breakthrough in high-performance graphics and compute, succeeding the Xe-HPG and Xe-HPC architectures. Starting with the Lunar Lake mobile platform in 2024, Xe2 first entered the market as an integrated GPU (iGPU), and was further enhanced in Panther Lake in 2025. Unlike discrete GPUs, the iGPU shares the same chip package with the CPU, bringing unique advantages and challenges.

For AI and machine learning developers, understanding the Xe2 architecture is crucial. Although the absolute compute power of Intel iGPUs cannot rival NVIDIA’s high-end GPUs, they offer three key advantages: ubiquity (nearly every laptop has one), zero additional cost (no need to purchase a discrete GPU), and unified memory access (CPU and GPU share physical memory with no explicit data copies required). This makes Intel iGPUs an ideal choice for local development, prototype validation, and edge inference scenarios.

This article will provide an in-depth analysis of the Xe2 architecture’s hierarchical structure, execution unit internals, memory subsystem, and conceptual mapping to the NVIDIA CUDA architecture. We will also compare the specifications of Lunar Lake and Panther Lake to help you understand the evolution of the Intel GPU ecosystem.

Evolution from Xe to Xe2

Intel’s Xe architecture family has gone through several generations. The earliest Xe-LP (Gen12) primarily targeted lightweight graphics workloads, followed by Xe-HPG (Alchemist, Arc A series) for high-performance gaming and content creation, and Xe-HPC (Ponte Vecchio) for data center-scale massively parallel computing. Xe2 unifies these evolutionary paths, significantly enhancing AI inference capabilities while improving energy efficiency.

Key improvements in Xe2 include: major upgrades to the XMX (Xe Matrix Extension) engine, supporting higher-throughput INT8, BF16, and FP16 matrix operations; larger Shared Local Memory (SLM), reducing global memory access latency; improved thread scheduler, supporting finer-grained concurrency control; and enhanced vector engine, providing higher SIMD throughput.

In Lunar Lake, Xe2 debuted as an iGPU with 8 Xe-cores (128 EUs), achieving a peak AI compute of 67 TOPS (INT8). Panther Lake further scales up, expected to include 10 Xe-cores (160 EUs) with peak compute exceeding 96 TOPS, along with significant improvements in memory bandwidth and cache capacity.

Xe2 Architecture Hierarchy

Click each level to view internal structure (Lunar Lake example)

Xe2 Microarchitecture Hierarchy

The Xe2 architecture employs a clear four-level hierarchy, from top to bottom: GPU -> Slice -> Xe-core -> EU (Execution Unit). Understanding this hierarchy is fundamental to writing efficient GPU code, as it directly corresponds to the hardware’s parallel organization.

GPU is the top-level unit, corresponding to a complete graphics processor. In the Lunar Lake iGPU, the GPU contains 1 Slice. Higher-end discrete or data center GPUs may contain multiple Slices for greater parallelism.

Slice is the primary functional unit of the GPU, containing a group of Xe-cores, shared L2 cache, and memory controllers. The single Slice in Lunar Lake contains 4 Xe-cores sharing 4MB of L2 cache. The Slice is the basic unit of resource allocation; the operating system and drivers typically schedule tasks at the Slice granularity.

Xe-core is the core compute unit. Each Xe-core contains 16 EUs, 64KB of Shared Local Memory (SLM), and L1 cache. All EUs within an Xe-core share the SLM, enabling threads within the same work-group to collaborate efficiently. In the SYCL or Level Zero programming model, a work-group typically maps to a single Xe-core.

EU (Execution Unit) is the smallest execution unit, containing a Vector Engine, XMX matrix engine, General Register File (GRF), and 8 hardware thread slots. Each EU can execute multiple threads concurrently, using rapid context switching to hide memory access latency.

This hierarchy is similar to but not identical to NVIDIA GPUs. NVIDIA GPUs consist of multiple SMs (Streaming Multiprocessors), each containing multiple CUDA Cores. Intel’s Xe-core roughly corresponds to NVIDIA’s SM, while the EU corresponds to the CUDA Core concept (though with richer functionality). Understanding these correspondences helps when porting CUDA code to Intel GPUs.

EU Internal Structure

The Execution Unit (EU) is the smallest unit that actually performs computation in the Xe2 architecture. Its internal structure determines the GPU’s instruction set and performance characteristics. Each EU contains four main components: Vector Engine, XMX Engine, GRF (General Register File), and Thread Slots.

Vector Engine is the traditional SIMD execution unit, supporting SIMD8-width vector operations. It can perform the same operation on 8 data elements in a single clock cycle, supporting FP32, FP16, INT32, and other data types. The Vector Engine is the core of general-purpose computation, handling scalar operations, vector addition, multiplication, logical operations, and more. In Xe2, the Vector Engine’s FP32 throughput is 8 ops/cycle, while FP16 doubles to 16 ops/cycle.

XMX (Xe Matrix Extension) Engine is the AI acceleration core of Xe2, specifically optimized for matrix multiplication. It supports systolic array-style dataflow, capable of completing multiple multiply-accumulate operations in a single cycle. Key XMX specifications include: INT8 at 128 ops/cycle, BF16 and FP16 at 64 ops/cycle. This makes XMX several times more efficient than the Vector Engine in inference and training scenarios. XMX corresponds to NVIDIA’s Tensor Core, but with a different programming interface — Intel uses DPAS (Dot Product Accumulate Systolic) instructions or high-level libraries (such as oneDNN) for invocation.

GRF (General Register File) is the per-thread register storage. In Xe2, each thread has 128 registers of 32 bytes each, totaling 4KB of private storage. The GRF is the fastest storage tier, typically with ~TB/s bandwidth. The compiler places frequently accessed variables in the GRF whenever possible to reduce dependence on slower memory tiers. The total GRF capacity per EU is 32KB (8 threads x 4KB).

Thread Slots implement hardware multithreading. Each EU supports 8 hardware threads residing simultaneously, and the GPU can rapidly switch between them (zero overhead), thereby hiding memory access and instruction pipeline latency. This design is similar to CPU Hyper-Threading but at a much larger scale with lower latency. In the programming model, each work-item (a single element within a sub-group) is ultimately scheduled to execute on a thread slot of some EU.

EU (Execution Unit) Internal Structure

Hover to view component specs

Memory Hierarchy

GPU performance is often limited by memory bandwidth rather than compute throughput, so understanding the memory hierarchy is key to optimizing GPU code. Xe2 employs a multi-level memory hierarchy, from fastest to slowest: GRF -> SLM -> L1 Cache -> L2 Cache -> System Memory (DRAM).

GRF (General Register File) is the per-thread private register storage with the highest bandwidth (~TB/s) but smallest capacity (4KB/thread). GRF is used to store hot variables, loop iterators, and temporary computation results. Efficient GPU code maximizes GRF utilization to reduce dependence on external memory.

SLM (Shared Local Memory) is the shared scratchpad storage within an Xe-core, with a capacity of 64KB/Xe-core (Lunar Lake) and bandwidth of approximately ~2 TB/s. SLM corresponds to CUDA’s Shared Memory or OpenCL’s Local Memory. All threads within the same work-group can quickly exchange data through SLM, which is essential for algorithms requiring thread cooperation (such as matrix tiling and reduction operations). Panther Lake increases SLM capacity to 128KB/Xe-core, further enhancing data locality.

L1 Cache is the first-level cache within the Xe-core, with a capacity of 64KB/Xe-core and bandwidth of approximately ~1 TB/s. L1 Cache is transparent to the programmer and managed automatically by the hardware. It primarily caches frequently accessed global memory data. Unlike SLM, L1 Cache cannot be explicitly controlled, but hit rates can be improved through access patterns (such as coalesced and sequential access).

L2 Cache is the Slice-level shared cache, shared by all Xe-cores. Lunar Lake’s L2 capacity is 4MB with bandwidth of approximately ~500 GB/s. L2 Cache reduces pressure on main memory access, especially when multiple Xe-cores access the same data. Panther Lake increases L2 capacity to 8MB, further improving cache hit rates.

System Memory (DRAM) is the lowest storage tier and the key difference between iGPUs and discrete GPUs. The Lunar Lake iGPU uses LPDDR5x system memory with bandwidth of approximately ~90 GB/s, shared with the CPU. This means GPU and CPU memory accesses compete for bandwidth, which is the primary bottleneck of iGPU performance. In contrast, discrete GPUs use dedicated HBM or GDDR6 memory with bandwidth exceeding 1 TB/s, not shared with the CPU.

The iGPU’s unified memory architecture has one significant advantage: Zero-Copy. CPU and GPU can directly access the same physical memory without the need for data transfers over the PCIe bus as required by discrete GPUs. This significantly reduces latency in scenarios involving small-scale data or frequent interactions. However, in large-scale parallel computing, the 90 GB/s bandwidth may become a bottleneck, which needs to be mitigated through effective use of SLM and L2 Cache.

Memory Hierarchy

Switch iGPU / dGPU to compare

Concept Mapping: Xe2 and CUDA

For developers familiar with NVIDIA CUDA, understanding the conceptual mapping between Intel Xe2 and CUDA can accelerate the learning curve. Although the programming models and hardware designs differ, the underlying principles of parallel computing are shared.

Xe2 vs CUDA Concept Mapping

Understanding Intel Xe and NVIDIA CUDA correspondence, helping CUDA developers onboard quickly

Several key mappings deserve special attention:

EU vs. CUDA Core: Intel’s Execution Unit is functionally richer than NVIDIA’s CUDA Core. An EU not only contains a SIMD vector engine but also integrates an XMX matrix engine and 8 hardware thread slots. In comparison, a CUDA Core primarily handles scalar or vector operations, with matrix acceleration relying on separate Tensor Cores.

Xe-core vs. SM (Streaming Multiprocessor): Both Xe-core and SM are compute clusters containing multiple execution units and shared memory. An Xe-core contains 16 EUs and 64KB SLM, while an NVIDIA SM (such as in the Ada Lovelace architecture) contains 128 CUDA Cores and up to 228KB Shared Memory. Both share a similar design philosophy aimed at enabling efficient thread cooperation.

SLM vs. Shared Memory: This is the most direct mapping. Intel’s Shared Local Memory and CUDA’s Shared Memory are functionally equivalent, both used for fast data exchange between threads within a work-group/thread block. The programming patterns are also similar, both requiring explicit declaration and synchronization.

Sub-group vs. Warp: A Sub-group is Intel’s SIMD execution unit, typically consisting of 8, 16, or 32 work-items (the supported width can be queried from hardware). A Warp is NVIDIA’s fixed 32-thread SIMD unit. Both execute in lockstep, meaning all threads in the same sub-group/warp execute the same instruction. Understanding this is crucial for avoiding branch divergence.

XMX vs. Tensor Core: Both are dedicated matrix acceleration hardware, but with different programming interfaces. NVIDIA uses WMMA (Warp Matrix Multiply-Accumulate) or Tensor Core intrinsics, while Intel uses DPAS (Dot Product Accumulate Systolic) instructions. High-level libraries (such as oneDNN and cuDNN) automatically invoke these hardware accelerators, but writing custom kernels requires learning the respective APIs.

Level Zero vs. CUDA Runtime: Level Zero is Intel’s low-level GPU programming interface, providing fine-grained hardware control, corresponding to the CUDA Driver API. Intel also provides the higher-level SYCL (similar to the CUDA Runtime API), suitable for rapid development. SPIR-V is the intermediate representation (IR) for Intel GPUs, corresponding to NVIDIA’s PTX.

With these mappings understood, CUDA developers can get up to speed with Intel GPU programming more quickly. Many CUDA optimization techniques (such as coalesced memory access, reducing branch divergence, and leveraging shared memory) apply equally to Intel GPUs.

Lunar Lake vs Panther Lake

Intel’s Xe2 architecture is employed in both the Lunar Lake and Panther Lake generations, but with significant differences in specifications and performance. Comparing these two generations helps us understand the evolution direction and performance expectations of Intel iGPUs.

Generation Comparison: Lunar Lake vs Panther Lake

Switch to view different generation specs

EU Count: Lunar Lake has 128 EUs (8 Xe-cores x 16 EUs), while Panther Lake is expected to increase to 160 EUs (10 Xe-cores x 16 EUs), a 25% increase. More EUs mean higher parallelism, especially in large-scale batch inference scenarios where more requests can be processed simultaneously.

XMX TOPS (AI Compute): Lunar Lake’s peak XMX compute is 67 TOPS (INT8), while Panther Lake increases to 96 TOPS, a 43% improvement. This increase comes not only from the higher EU count but also from microarchitectural optimizations to the XMX engine itself. For matrix-intensive workloads such as Transformer inference and convolutional neural networks, this improvement directly translates to higher throughput.

SLM (Shared Local Memory): Lunar Lake has 64KB SLM per Xe-core, while Panther Lake doubles it to 128KB, a 100% increase. This is one of the most significant improvements. Larger SLM allows for bigger work-groups, greater data locality, and reduced dependence on global memory. For algorithms requiring thread cooperation (such as tiled matrix multiplication and fast Fourier transforms), this improvement delivers substantial performance gains.

L2 Cache: Lunar Lake’s L2 cache is 4MB, while Panther Lake increases to 8MB, a 100% improvement. A larger L2 cache can hold more of the working data set, reducing system memory accesses. In multi-Xe-core scenarios, L2 cache hit rate directly impacts overall performance.

Memory Bandwidth: Lunar Lake uses LPDDR5x-7500 with bandwidth of approximately 90 GB/s, while Panther Lake is expected to upgrade to LPDDR5x-8000 or higher, with bandwidth of approximately 120 GB/s, a 33% increase. While this bandwidth remains far below that of discrete GPU HBM (~1 TB/s), it is a significant improvement for an iGPU. Higher bandwidth supports larger models and greater batch sizes.

Overall Assessment: Panther Lake shows significant improvements across all key metrics, especially the doubling of SLM and L2 cache, indicating that Intel recognizes the importance of the memory hierarchy for iGPU performance. For AI inference, Panther Lake is better suited for running medium-scale models (such as BERT-Base and ResNet-50) and batch inference services. Lunar Lake is more appropriate for lightweight inference, prototype validation, and single-request inference scenarios.

Recommended Learning Resources

In-depth study of the Intel Xe2 architecture and iGPU programming requires official documentation, technical white papers, and community resources. The following is a curated selection of high-quality learning resources:

Official Documentation:

Intel Xe2 Architecture White Paper: Detailed coverage of the Xe2 microarchitecture, instruction set, and performance characteristics.
oneAPI GPU Optimization Guide: Official guide covering SYCL, Level Zero, and GPU performance tuning.
Intel Data Center GPU Max Series Overview: While focused on data center GPUs, the architectural principles are shared with Xe2.

Programming Frameworks:

Intel oneAPI Toolkit: Includes the DPC++ (SYCL) compiler, oneDNN (deep learning library), Level Zero runtime, and the complete toolchain.
SYCL 2020 Specification: The Khronos standard for cross-platform parallel programming, Intel’s primary programming interface.
Level Zero Programming Guide: Low-level GPU programming interface for scenarios requiring fine-grained control.

Performance Analysis Tools:

Intel VTune Profiler: Supports CPU and GPU performance analysis, visualizing EU utilization, memory bandwidth, thread scheduling, and other metrics.
Intel Advisor: Code modernization and parallelization recommendation tool to help optimize GPU code.

Community Resources:

Intel GPU Compute Community Forum: Official community forum for questions and experience sharing.
Codeplay SYCL Academy: Free SYCL tutorials and exercises, suitable for beginners.

Video Tutorials:

Intel oneAPI YouTube Channel: Official video tutorials covering SYCL, GPU optimization, AI inference, and more.
IWOCL/SYCLcon Conference Talks: International OpenCL and SYCL conference featuring many high-quality technical presentations.

Summary

The Intel Xe2 GPU architecture represents Intel’s strategic positioning in high-performance computing and AI. Through its four-level hierarchy (GPU -> Slice -> Xe-core -> EU), powerful XMX matrix engine, multi-level memory hierarchy, and unified memory architecture, the Xe2 iGPU demonstrates unique value in lightweight inference, local development, and edge computing scenarios.

While iGPU absolute compute power cannot match high-end discrete GPUs, its ubiquity (nearly every laptop has one), zero additional cost, and zero-copy memory access make it an ideal platform for individual developers and enterprises for rapid prototyping. The specification comparison between Lunar Lake and Panther Lake shows that Intel is continuously improving iGPU AI compute and memory subsystems, with the potential to support larger models and more complex inference workloads in the future.

For CUDA developers, understanding the Xe2-to-CUDA concept mapping enables rapid knowledge transfer. For Intel GPU newcomers, mastering the SYCL and Level Zero programming models, becoming familiar with the oneAPI toolchain, and leveraging VTune and Advisor for performance analysis are key to efficient GPU application development.

In the following articles, we will explore the oneAPI programming model, SYCL language features, and how to efficiently run LLM inference and training workloads on Intel iGPUs.