Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

NPU Execution Model and the Boundaries of Its Programming Model

NPU Execution Model and the Boundaries of Its Programming Model

Updated 2026-04-15

Introduction

In the previous article, we understood how the NPU runs LLMs through pre-allocated KV cache, two models (prefill + generate), and a three-layer software stack. But one key question remained unexplored: how exactly do these compiled blobs execute on the NPU hardware?

This article dives deep into the NPU’s execution model — how a 32-layer Transformer is unrolled into a flat task list, how three pipelines (DMA/DPU/SHAVE) cooperate through barriers to hide memory latency, and how the compiler selects the optimal implementation path for attention. Finally, we step back from the specifics to reflect on the ceilings of the NPU programming model: what ONNX can and cannot express, and the potential transformation that a CuTe-style DSL could bring.

Executing a 32-Layer Transformer: One Blob, No Loops

CMX Capacity Constraints

As mentioned in the NPU Architecture Overview, the NPU’s on-chip high-speed storage CMX has limited capacity (KB~MB scale). A single layer of a 7B-parameter LLM can have weights on the order of tens of MB — far more than CMX can hold.

The solution is streaming: load only one layer’s weights from DDR into CMX at a time, compute, release the CMX space, then load the next layer. This stands in stark contrast to GPUs, which load all weights into VRAM at once and access them directly during computation.

Compile-Time Unrolling: No Loops, Only a Task List

npu_compiler fully unrolls the 32-layer Transformer into a flat task list — potentially containing thousands of DMA + DPU + SHAVE tasks. There are no explicit boundaries between layers and no host-side for loops.

The entire blob execution flow is:

  1. The host loads the blob into NPU-accessible memory
  2. The host submits a pointer (the MappedInference struct) to the NPU
  3. The NPU runs fully autonomously: the management core (RISC-V on 40xx) reads task descriptors one by one, checks barrier conditions, and dispatches to DMA/DPU/SHAVE
  4. After all tasks complete, the management core writes the fence value to notify the host

This contrasts sharply with the GPU’s kernel launch model: on a GPU, each kernel is a host-to-device dispatch, and the host always retains control; the NPU is a single submission followed by complete hands-off until all layers finish executing.

Analogy: A GPU is like a director calling “Action” shot by shot on set. An NPU is like having the entire film’s shooting plan laid out in advance — once you hit play, the director can walk away.

DMA/Compute Pipeline Overlap

The Core Optimization: Hiding Transfer Latency

If DMA, DPU, and SHAVE executed strictly in sequence (transfer data -> compute matrix ops -> compute activations -> transfer next layer…), a huge amount of time would be wasted waiting. The NPU’s key performance optimization is three parallel pipelines:

  • While DMA transfers layer N+1’s weights, DPU is computing layer N
  • While SHAVE executes layer N’s activation functions, DMA may already be transferring layer N+2’s data

This overlap is achieved through barrier synchronization: the management core checks each barrier’s producer/consumer counts and only dispatches subsequent tasks once their prerequisites are complete.

NPU 流水线重叠执行
DMA数据搬运DPU矩阵运算SHAVE激活函数012345678910111213141516171819202122232425L0t=0虚线 = 同步屏障(数据依赖)
t = 0 / 24
串行总时间
24
单位时间
流水线总时间
15
单位时间
加速比
1.6x
DMADPUSHAVE虚线 = 同步屏障(数据依赖)

FeasibleMemoryScheduler: Four Core Decisions at Compile Time

This pipeline parallelism is not determined dynamically at runtime — it is fully planned at compile time by the FeasibleMemoryScheduler:

1. CMX Capacity Management

A linear scan algorithm tracks CMX usage at each time step, ensuring it never exceeds physical capacity. Each DMA-in operation increases usage; each completed computation releases the corresponding space.

2. Prefetch Depth

The prefetchingLevelLimit parameter controls how many steps ahead DMA begins transferring data. Greater prefetch depth means better latency hiding but also higher CMX pressure — a classic trade-off.

3. Dynamic Spilling

When CMX is full but new data needs to be loaded, the compiler selects data that is not immediately needed and temporarily writes it back to DDR (spill out), freeing CMX space. When needed later, it is loaded back (spill in).

4. Ping-Pong Buffering

Two CMX buffers alternate: one is read by DPU/SHAVE for computation while DMA writes new data into the other. Computation and transfer fully overlap with no waiting required.

Comparison with GPU

There is an interesting symmetry here (related to the CUDA Programming Model):

NPUGPU
GoalHide DDR->CMX transfer latencyHide Global Memory->SM latency
MechanismDMA prefetch (planned at compile time)Stream + async memcpy (scheduled at runtime)
Decision timingFully determined at compile timeDynamically decided at runtime by the warp scheduler
AnalogyPlaying a pre-recorded vinyl recordA jazz band improvising live

The concept is the same — hiding memory latency. But the mechanisms are opposite — GPUs use dynamic runtime scheduling, while NPUs have everything planned at compile time (zero runtime decisions).

Three Attention Paths on the NPU

Attention is the core computation of the Transformer and the part that the NPU compiler must optimize most carefully. Depending on the inference phase and hardware capabilities, the compiler chooses among three implementation paths:

Path 1: Decompose SDPA

The fallback path when the target hardware does not support a dedicated attention operator. It decomposes Scaled Dot-Product Attention into a sequence of independent operations:

  1. Q×KTQ \times K^T -> DPU (NCE MatMul)
  2. ×scale\times \text{scale} -> DPU (PPE, fused in MatMul post-processing)
  3. +attention_mask+ \text{attention\_mask} -> DPU/SHAVE (PPE can only add per-channel constants; a 2D mask requires SHAVE)
  4. Softmax\text{Softmax} -> SHAVE (no corresponding DPU fixed-function hardware)
  5. ×V\times V -> DPU (NCE MatMul)

Intermediate results must be transferred between CMX/DDR (unless Vertical Fusion is active — see the next section).

Path 2: Flash SDPA

Suitable for long sequences in the prefill phase. The entire attention computation runs as a single SHAVE kernel, with mask handling fused internally.

The core idea: tile the KV cache along the seq_len dimension, compute local attention independently for each tile, while maintaining three rolling states:

  • running_output: The accumulated weighted output
  • running_max: The current maximum (for numerically stable online softmax)
  • running_sum: The normalization denominator

The UnrollFlashSDPA pass in the compiler unrolls a single FlashSDPA op into a chain of tiles.

This uses the same online softmax algorithm as the Flash Attention paper, but with different tiling constraints: the GPU version is limited by shared memory capacity, while the NPU version is limited by CMX capacity.

Path 3: Incremental SDPA

Suitable for the decode phase, where the query consists of only 1 token. Q×KTQ \times K^T degenerates from a matrix multiplication to a vector-matrix multiplication, which can be implemented with a specially optimized SHAVE kernel with mask handling inside the kernel.

The evolution of these three paths follows this trajectory: first the general sdpa, then the decode-specialized incremental_sdpa, then the prefill-specialized flash_sdpa — progressively specializing as optimization demands grow.

Attention 实现路径选择

编译器如何为不同场景选择最优 attention 实现

推理阶段
Prefill(预填充)Decode(解码)
长序列 + 硬件支持专用 kernel?
Flash SDPA

整个 attention 作为单个 SHAVE kernel 执行

KV cache 按 seq_len 分块,维护 running_max/sum 保证数值稳定

UnrollFlashSDPA pass 将一个 FlashSDPA op 展开为 tile 链

Decompose SDPA

回退方案:将 SDPA 拆分为基础算子分别执行

中间结果需要 CMX/DDR 传输

硬件支持专用 kernel?
Incremental SDPA

Q 仅 1 个 token,矩阵乘退化为向量-矩阵乘

专用优化 SHAVE kernel,mask 内部处理

Decompose SDPA

回退方案:将 SDPA 拆分为基础算子分别执行

中间结果需要 CMX/DDR 传输

DPU矩阵运算单元
SHAVE向量/标量处理单元
DPU/SHAVE条件分配

编译器在编译时根据模型结构和目标硬件能力选择路径

Tiling and Vertical Fusion

DPU Tiling

When an operation’s input/output tensors are too large to fit entirely in CMX, the compiler splits them into multiple tiles along the H (height) or C (channel) dimension, processing each tile independently:

  • Each tile corresponds to a DPUVariant (different workloads under the same DPUInvariant)
  • DpuTiler automatically determines the split strategy based on CMX capacity and hardware alignment requirements
  • The split is transparent to the computation result — concatenating tile outputs produces exactly the same result as the unsplit version

SHAVE Tiling

TileActShaveKernelTask handles SHAVE task splitting. The key principle: prefer dimensions that do not produce strided memory access. Strided access means data is non-contiguous in memory, requiring additional DMA rearrangement operations at high cost.

Vertical Fusion

Vertical Fusion is one of the most impactful optimizations on the NPU (related to the operator fusion concept in the Graph Compilation Optimization learning path).

The PipeliningVFScheduling pass identifies consecutive operation sequences (e.g., MatMul -> RoPE -> SDPA) and fuses them into a “vertical fusion region.” The key benefit after fusion: intermediate results stay in CMX and do not need to be written back to DDR and read back.

This is particularly valuable for attention blocks: QKV projection outputs can be consumed directly in CMX by RoPE and SDPA, saving MB-scale DDR round trips.

Key difference from GPU operator fusion:

NPU Vertical FusionGPU Operator Fusion
What is savedDDR round-trip transfers (MB-scale data)Kernel launch overhead (microsecond-scale)
ConstraintsCMX capacityShared memory + register pressure
Decision timingCompile timeCompile time (XLA/TVM) or manual (CUDA)

RoPE and Position IDs

Attention itself is unaware of token ordering; positional information is injected through position_ids and RoPE (Rotary Position Embedding) (for mathematical details, see the Positional Encoding article).

On the NPU, RoPE is implemented as a dedicated SHAVE kernel. The compiler’s fuse_rope pass recognizes Sin/Cos/Multiply patterns in the IR and fuses them into a single efficient RoPE operator.

A detail under static shapes: valid values in position_ids increment from 0, with padding positions filled with 0. The alignment depends on the phase:

Prefill (left-aligned — valid tokens first, padding after):

input_ids      = [t1, t2, t3, t4, 0, 0, 0, 0]
attention_mask = [1,  1,  1,  1,  0, 0, 0, 0]
position_ids   = [0,  1,  2,  3,  0, 0, 0, 0]

Generate (right-aligned — padding first, new token at the last position):

input_ids      = [0, 0, 0, 0, 0, 0, 0, token]
attention_mask = [1, 1, 1, 1, 1, 0, 0, 0]
position_ids   = [0, 0, 0, 0, 0, 0, 0, 4]

In the generate phase, the last value in position_ids equals the current token’s actual position in the sequence (i.e., the number of 1s in the attention_mask).

Practical Optimizations and Current Limitations

DynamicDataMask: The padding region contains zeros or garbage data, which is masked out for attention. But LayerNorm and reduction operations are not mask-aware — they include padding values in their computations, leading to incorrect results. The compiler uses DynamicDataMask to insert zeroing operations before these operators, ensuring padding does not corrupt the computation.

LM Head Separation: The final layer’s vocabulary projection (hidden_size x vocab_size) is a very large matrix multiplication. When vocab_size is large (e.g., 32000+), the NPU may not be faster than the CPU. NPUW can carve this operation out and execute it on the CPU instead.

Other Optimizations (Brief)

  • Prefix Caching: In multi-turn conversations, the system prompt’s KV cache can be cached and reused, skipping redundant prefill
  • Speculative Decoding support: trim_kvcache_for_speculative_decoding handles KV cache truncation for rejected tokens

Current Limitations

The following limitations all stem from the NPU’s static execution model:

LimitationCauseGPU Comparison
Fixed KV cache capacityBlob shapes are immutablePagedAttention pages on demand
batch_size = 1Each batch size requires separate compilationContinuous batching
KV cache transfer overhead (~512MB)Prefill and generate are separate blobsDirect access within the same kernel
Slow cold start (multi-variant compilation)Multiple generate blobsRuntime JIT compilation

Reflecting on the Programming Model: The Boundaries of ONNX and Lessons from CuTe

Four Cracks in “The Compiler Handles Everything”

Looking back at the preceding content, the NPU’s programming model is essentially “developers provide an ONNX model, and the compiler handles all optimization.” This model works well for standard scenarios, but in LLM inference it exposes four structural issues:

  1. Each attention variant requires hand-written SHAVE kernels: sdpa -> incremental_sdpa -> flash_sdpa; new variants (Sliding Window, Cross Attention, Linear Attention) must wait for the compiler team to implement them, with turnaround measured in months
  2. Side effects of static shapes require dedicated patches: DynamicDataMask is one example. A different KV cache management strategy might require new patches
  3. Generate variants are brute-force enumeration of runtime dynamism: Currently only KV cache capacity is enumerated. If batch size and beam width also need enumeration, the combinatorial explosion is severe
  4. KV cache transfer is the price of an architectural constraint: The prefill-to-generate copy exists because the two blobs do not share an address space. If operator authors could control memory layout, this copy could potentially be avoided

What ONNX Can and Cannot Express

ONNX (and OpenVINO IR) is fundamentally a computation graph description language:

Can express: What operations to perform (MatMul, Softmax, Add…) and the data dependencies between them

Cannot express:

  • Tiling strategies: The core of Flash Attention is not the Q x K x V operations themselves, but how to tile, and how tiles pass running_max and running_sum between each other
  • Memory layout preferences: [batch, heads, seq_len, dim] vs [batch, seq_len, heads, dim] for KV cache has a massive impact on DMA efficiency, but ONNX does not express this preference
  • Fusion decisions: Vertical Fusion is a huge optimization, but ONNX does not describe which intermediate results should stay on-chip

Lessons from CuTe: A Third Way

Currently, the NPU’s programming abstraction has only two extremes:

  • ONNX: Very high-level, does not express tiling — all optimization is left to the compiler
  • Raw SHAVE ASM: Very low-level, expresses everything — but the development barrier is extremely high

NVIDIA’s CuTe (the tile abstraction in CUTLASS 3.0) carved out a middle layer on the GPU: operator authors control tile sizes and loop structures, while the compiler handles DMA and hardware mapping.

Imagine if the NPU had a CuTe-style DSL:

  • Flash SDPA would not require the compiler team to hand-write SHAVE kernels; operator authors could directly write tiled attention
  • Tiling strategies could be tuned to model characteristics (long context -> large KV tiles, small models -> no tiling needed)
  • New algorithm validation cycles would shrink from months to days
DimensionONNXCuTe-style DSL
What it expressesComputation graph (what to do)Tiled algorithm (how to tile, how to pass state)
TilingNot expressed, engine handles automaticallyOperator author specifies tile shapes
DMA/memory transfersNot expressed, engine handles automaticallyNot expressed, compiler handles automatically
New operatorsWait for engine support (months)Operator author writes directly (days)
Tuning spaceVirtually zeroTile sizes, loop ordering, partitioning strategies
PortabilityExcellent (cross-hardware)Good (across same-vendor hardware generations)
Development barrierVery low (export model)Moderate (requires understanding tiling concepts)

Key insight: Both CuTe and ONNX hide DMA, but CuTe exposes tiling. Tiling is the interface between algorithm and hardware — looking up, it depends on algorithmic knowledge (only the algorithm author knows how to pass state between tiles); looking down, it depends on hardware knowledge (only the compiler knows CMX capacity and DMA bandwidth).

NPU vs GPU 执行模型对比

点击维度切换对比视角

NPU (Intel Meteor Lake)
  • 编译期完成全部规划
  • Blob 包含所有任务描述符 (task descriptors)
  • 管理核 (management core) 顺序读取执行
  • 运行时零决策 — 确定性执行
  • 🎵 好比播放预先录制好的专辑
GPU (NVIDIA CUDA)
  • 运行时动态调度
  • Host 发起 kernel launch
  • Warp scheduler 分配工作给 SM
  • 动态资源分配 — 按需调整
  • 🎷 好比现场即兴的爵士乐队
核心概念映射
NPUGPU说明
CMXShared Memory高速片上暂存器
DMA prefetchasync memcpy + streams隐藏内存延迟
BarrierStream event / __syncthreads同步机制
Blob (ELF)Kernel binary (cubin)编译后的可执行体
Management coreWarp scheduler任务分发单元
DPUTensor Cores固定功能矩阵运算
SHAVECUDA Cores可编程计算单元

A Balanced View

It is important to emphasize:

  • ONNX is the right choice for 99% of users. The optimization paths for standard operations (Conv, MatMul, attention) are already good enough
  • CuTe targets the 1% who are operator authors — but their productivity determines the pace of optimization inside inference engines
  • The two are complementary, not substitutes: ONNX users need not change anything, yet they benefit from faster SDPA implementations

The bottleneck in today’s NPU ecosystem is human capital: the number of people worldwide who can write SHAVE kernels + npu_compiler passes may not exceed a few dozen. Model authors cannot experiment with how new attention patterns perform on the NPU — they can only wait for the compiler team to implement the operators that are already supported. The NPU’s positioning (a low-power AI assistant in laptops) makes these limitations acceptable for now, but if it needs to support a broader model ecosystem, some form of operator programmability may be unavoidable.

Summary

This article traced a complete arc from hardware execution to the boundaries of the programming model:

  1. Execution model: A 32-layer Transformer is unrolled into a flat task list; the NPU runs fully autonomously after a single submission
  2. Pipeline overlap: Three pipelines (DMA/DPU/SHAVE) cooperate through barriers, with everything planned at compile time
  3. Attention paths: The compiler selects the optimal implementation from three paths based on inference phase and hardware capabilities
  4. Tiling and fusion: DPU/SHAVE tiling addresses CMX capacity limits; Vertical Fusion eliminates DDR round trips
  5. Programming model reflection: ONNX does not express tiling; CuTe exposes tiling — this is the intermediate abstraction layer the NPU may need in the future

The NPU’s positioning: Low-power, single-user LLM inference on laptops. batch=1 is the norm, fixed KV cache capacity covers most conversational scenarios, and the core advantage lies in low power consumption and freeing up the GPU.

The ceiling of hardware is not just compute and bandwidth — it is also how many people the programming model allows to write efficient code for it.

Further Reading