NPU Architecture and GPU+NPU Co-Inference

Introduction

In Intel Core Ultra processors, alongside the Xe2 integrated graphics (iGPU), there is a dedicated AI inference accelerator called the NPU (Neural Processing Unit), commercially branded as AI Boost. The NPU and iGPU have different design objectives: the NPU is optimized for low-power, fixed-topology neural network inference scenarios, while the iGPU offers greater computational flexibility and broader model support.

OpenVINO provides device plugins such as AUTO, MULTI, and HETERO that allow developers to leverage the strengths of both NPU and GPU within a single inference session. AUTO mode automatically selects the optimal device, MULTI mode distributes inference requests in parallel across multiple devices, and HETERO mode splits the computation graph across different devices based on operator support. These mechanisms enable developers to flexibly balance power consumption, latency, and throughput.

This article provides a detailed introduction to the Intel NPU’s hardware architecture, compares NPU and iGPU use cases, provides an in-depth analysis of OpenVINO’s multi-device inference strategies, and uses interactive visualizations to demonstrate the performance and power trade-offs of GPU+NPU co-inference.

Intel NPU Architecture Overview

The Intel NPU is based on the NCE (Neural Compute Engine) architecture, consisting of multiple dedicated inference compute units. Unlike the general-purpose GPU Execution Units, the NCE is hardware-optimized for common neural network operators such as convolution, matrix multiplication, and activation functions, eliminating the overhead of general-purpose instruction decoding and scheduling.

NCE Cluster Architecture

The NCE is the core compute cluster of the NPU, containing two types of execution units — DPU and SHAVE — along with a DMA engine for data transfer. This architecture has been consistent from 37xx (Meteor Lake) through 40xx (Lunar Lake) to 50xx (Panther Lake); SHAVE was never replaced, but rather works alongside DPU:

NPU Tile (NCE Cluster)
├── DPU — Fixed-function hardware (convolution, matrix multiplication)
│   ├── IDU (Input Data Unit) — Input data reading
│   ├── MPE (Matrix Processing Element) — Matrix compute core
│   ├── PPE (Post-Processing Engine) — Post-processing (scaling, bias addition)
│   └── ODU (Output Data Unit) — Output data writing
├── SHAVE_NN — Programmable vector processor (NN operators: softmax, RoPE, attention kernels)
├── SHAVE_ACT — Programmable vector processor (activation functions: ReLU, GELU, SiLU)
└── DMA — Data transfer engine (DDR ↔ CMX)

DPU is the specialist — it executes fixed operations (convolution, matrix multiplication) at extreme speed but is not programmable. SHAVE is the generalist — a programmable vector processor handling flexible computations that DPU cannot perform (e.g., softmax, LayerNorm, RoPE positional encoding). The compiler decides each operator’s execution target layer by layer: DPU first → DMA → SHAVE.

Verification source: The Config_ExecutorKind enum in npu_compiler source defines five executor types: DMA_NN, NCE, DPU, SHAVE_NN, and SHAVE_ACT. Files NPU40XX/shave_kernel_info.cpp and NPUReg40XX/ops/act_shave_rt.cpp exist in the 40xx code path, confirming that SHAVE remains active in the 40xx architecture.

DPU Internal Units

The DPU contains a streamlined pipeline with four units, each handling a specific stage:

IDU (Input Data Unit): Reads input tensors from CMX, aligning data to the format required by DPU (e.g., NHWC blocked layout)
MPE (Matrix Processing Element): Performs the actual MAC (Multiply-Accumulate) computation, supporting INT8/FP16 precision
PPE (Post-Processing Engine): Performs scaling (multiply by constant) and bias addition immediately after matrix multiplication, without requiring a separate task
ODU (Output Data Unit): Writes computed results back to CMX, supporting output format conversion

Verification source: The 40xx DPU four-unit structure can be found in vpu_nce_hw_40xx.h and the expand_dpu_config/ directory in npu_compiler.

CMX/DDR Two-Level Memory Hierarchy

The NPU has a two-level memory hierarchy — understanding this distinction is essential for subsequent articles:

DDR (System Memory): Large capacity (GB-scale), slow access. Stores model weights, KV cache, and input/output tensors
CMX (Connection MatriX): NPU on-chip high-speed SRAM. Each NPU tile has its own CMX with small capacity (KB~MB scale) but extremely fast access. Data must reside in CMX for DPU and SHAVE to compute on it
DMA engines handle data transfer between DDR ↔ CMX, and the transfer schedule is determined at compile time — a key characteristic of the NPU’s static execution model

This two-level structure means the NPU cannot directly access large-capacity storage like a GPU can. Before each computation, DMA must first move data from DDR into CMX. The compiler plans all transfer timing and ordering at compile time.

40xx Management Core

Different NPU generations use different management cores responsible for receiving host commands, reading task lists, and scheduling DMA/DPU/SHAVE execution:

37xx (Meteor Lake): Uses a Leon/SPARC management core
40xx (Lunar Lake): Upgraded to a RISC-V management core

Source: vpu_jsm_job_cmd_api.h mentions “RISC-V facilitates cache-bypass, memory access”.

The management core’s key responsibility is: at runtime, it reads task descriptors one by one, checks barrier synchronization conditions (producer/consumer counts), and dispatches tasks to the corresponding execution unit (DMA, DPU, or SHAVE) once conditions are met.

NPU Hardware Overview

Putting all components together, the complete NPU hardware architecture is:

x86 CPU (host)
  │
  │  DRM ioctl / Level Zero API
  │
NPU Chip (40xx Lunar Lake)
  ├── RISC-V Management Core — Receives host commands, reads task lists, schedules execution
  ├── DMA Engine             — DDR ↔ CMX data transfer
  ├── NCE Cluster(s)
  │   ├── DPU                — Matrix multiplication, convolution (fixed-function)
  │   └── SHAVE              — softmax, RoPE, activation functions (programmable)
  └── CMX                    — On-chip high-speed SRAM

In the Intel Core Ultra (Lunar Lake), the NPU’s peak compute capability is approximately 48 TOPS (INT8), but its core advantage lies in energy efficiency: for the same inference task, the NPU’s power consumption is typically only 1/3 to 1/5 that of the iGPU. This makes the NPU the preferred choice for battery-life-sensitive scenarios such as laptops and mobile devices.

NPU vs iGPU Use Cases

The NPU and iGPU have fundamental design differences, and therefore suit different inference scenarios. The radar chart below compares their performance across five key dimensions.

NPU Advantage Scenarios:

Low-power sustained operation: For example, real-time speech recognition (Whisper), continuous background vision tasks (face detection, OCR), requiring the device to operate for extended periods on battery power.
Fixed-topology small models: Models with stable structure, high operator support coverage, and smaller parameter counts (e.g., MobileNet, EfficientNet, small Transformers).
Latency-sensitive single inference: The NPU’s on-chip SRAM and pipeline design result in extremely low single-inference startup overhead, suitable for interaction scenarios requiring fast response.

iGPU Advantage Scenarios:

Large models, high throughput: LLMs with 7B+ parameters, or vision tasks requiring large-batch parallel inference — the iGPU has advantages in memory capacity and parallelism.
Dynamic graphs, custom operators: Models containing operators not supported by the NPU (e.g., sparse attention, custom CUDA kernels), or inference with dynamically changing graph structure.
Plugged-in scenarios, performance-first: On desktops or plugged-in laptops where power constraints are relaxed, the iGPU delivers higher absolute performance.

In practical deployment, developers often need to weigh the trade-offs between NPU and iGPU based on the specific model, hardware environment, and application scenario. OpenVINO’s multi-device inference mechanisms make this process more flexible.

OpenVINO Device Plugin System

OpenVINO’s Device Plugin is the abstraction layer between the inference runtime and specific hardware. Each Plugin is responsible for mapping the compiled computation graph (Compiled Model) to the target device’s (GPU, NPU, CPU) execution engine.

In addition to single-device plugins (such as GPU, NPU, CPU), OpenVINO provides three multi-device coordination modes:

AUTO Mode: Automatic device selection. During the first inference, OpenVINO queries all available devices’ capabilities (Capability Query), runs lightweight benchmarks, and then selects the device with the lowest latency or highest throughput.
MULTI Mode: Multi-device parallel inference. Distributes inference requests in round-robin fashion across multiple devices (e.g., GPU and NPU), with each device independently executing the complete computation graph, and results aggregated before returning. Suitable for throughput-sensitive batch processing scenarios.
HETERO Mode: Heterogeneous subgraph partitioning. Splits the computation graph into multiple subgraphs based on operator support, executing them on different devices. For example, NPU-supported convolution and matrix multiplication run on the NPU, while unsupported custom operators run on the GPU, connected through cross-device memory copies.

The interactive flowchart below illustrates how these three modes work.

Device Discovery

AUTO Plugin In Detail

The AUTO plugin is designed for “out-of-the-box” optimal performance. When developers specify device="AUTO", OpenVINO executes the following steps:

Device Enumeration: Scans all available inference devices in the current system (GPU, NPU, CPU), querying their hardware specifications and driver versions.
Capability Query: Sends Capability Queries to each device’s Plugin to obtain supported operator lists, precision support (FP32/FP16/INT8), memory limits, and other information.
Benchmarking: If the configuration file has no cached performance data, AUTO runs one or more inferences on each device, measuring latency and throughput.
Device Selection: Based on the ov::hint::PerformanceMode configuration (latency-first or throughput-first), selects the optimal device. For example, LATENCY mode selects the device with the lowest latency (typically NPU), while THROUGHPUT mode selects the device with the highest throughput (typically GPU).
Fallback Mechanism: If the preferred device is unavailable (e.g., driver not installed, insufficient memory), AUTO automatically falls back to the next best device.

The advantage of AUTO mode is simplifying device selection complexity — developers can achieve reasonable performance without explicitly specifying a device. However, its limitation is that it can only select one device and cannot simultaneously leverage multiple devices’ compute power.

GPU+NPU Hybrid Inference Pipeline

In certain scenarios, a single device cannot meet all requirements: for example, the model contains operators not supported by the NPU, or the NPU’s memory is insufficient for the entire model. In these cases, the HETERO plugin can partition the computation graph into multiple subgraphs, executing them on different devices.

The core logic of HETERO is Subgraph Partitioning:

Operator Support Query: Traverses each operator (Operation) in the computation graph, querying whether the NPU Plugin supports it. For example, Convolution, MatMul, and ReLU are typically supported, while NonMaxSuppression and TopK may not be.
Subgraph Partitioning: Assigns contiguous regions of supported operators (Subgraphs) to the NPU, and unsupported regions to GPU or CPU. Partition points are the cross-device boundaries.
Cross-Device Data Transfer: Inserts memory copy operators (MemCopy) at partition points to transfer intermediate activation tensors from one device’s memory to another. This step introduces additional latency and power overhead.
Scheduled Execution: Each subgraph executes on its corresponding device in parallel or serial, with OpenVINO’s Scheduler managing dependencies and synchronization.

The interactive component below shows a simplified Transformer model being partitioned between NPU and GPU. You can drag the slider to adjust partition points and observe changes in NPU/GPU load, communication overhead, and latency.

NPU/GPU Split Point: 50%

All GPUBalancedAll NPU

Key Insights:

Fewer partition points is better: Each additional partition point requires transferring intermediate results between devices, introducing latency and power overhead. Ideally, the computation graph should be divided into as few contiguous subgraphs as possible.
Load balancing: If one device is overloaded (e.g., NPU receives 80% of the computation), it becomes the bottleneck; overall latency is optimal when load is balanced.
Model design considerations: During model training, operator selection (e.g., using standard convolution instead of Deformable Convolution) can improve NPU operator support coverage and reduce the number of partition points.

Power and Performance Trade-offs

In real-world deployment, developers need to balance power consumption, latency, and throughput. The chart below shows power-performance curves for different device configurations (NPU-only, GPU-only, GPU+NPU hybrid) across different model scales.

Key Findings:

NPU-only: Lowest power consumption (5-12W), but throughput is limited by NPU parallelism and memory capacity. Suitable for small models and battery-life-sensitive scenarios.
GPU-only: Highest throughput (70-80 infer/s), but also highest power consumption (18-50W). The GPU’s performance advantage becomes more pronounced as model size increases.
GPU+NPU Hybrid: Provides the best energy efficiency (throughput/watt) in most scenarios. Placing power-sensitive frontend operators (such as Token Embedding, shallow Attention) on the NPU and compute-intensive backend operators (such as deep FFN) on the GPU achieves a balance between power and performance.

Deployment Recommendations:

Battery-life first (laptop on battery, mobile devices): Prefer NPU-only or AUTO mode (Latency optimization).
Performance first (desktops, plugged-in laptops): Prefer GPU-only or MULTI mode (Throughput optimization).
Flexible deployment (cloud, edge hybrid): Use HETERO mode, dynamically adjusting partition points based on runtime power and performance monitoring.

Summary

Intel Core Ultra’s NPU and iGPU provide developers with multiple choices for power and performance. The NPU’s low power consumption and low latency make it ideal for mobile and embedded scenarios, while the iGPU’s high throughput and flexibility suit large models and plugged-in scenarios.

OpenVINO’s AUTO, MULTI, and HETERO plugin mechanisms further enhance flexibility. AUTO mode allows developers to achieve reasonable performance without explicitly selecting a device, MULTI mode improves throughput through parallel distribution, and HETERO mode enables cross-device collaboration through subgraph partitioning. In practical deployment, developers should choose appropriate device configurations and inference strategies based on model characteristics, hardware environment, and application requirements.

As NPU hardware capabilities continue to improve (e.g., FP16 support, dynamic shapes, sparsification) and the OpenVINO software stack continues to be optimized, GPU+NPU co-inference will become an important trend in AI application deployment. Mastering the principles and best practices of multi-device inference is a key capability for building efficient, low-power AI systems.