NPU Execution Model and the Boundaries of Its Programming Model

Introduction

In the previous article, we understood how the NPU runs LLMs through pre-allocated KV cache, two models (prefill + generate), and a three-layer software stack. But one key question remained unexplored: how exactly do these compiled blobs execute on the NPU hardware?

This article dives deep into the NPU’s execution model — how a 32-layer Transformer is unrolled into a flat task list, how three pipelines (DMA/DPU/SHAVE) cooperate through barriers to hide memory latency, and how the compiler selects the optimal implementation path for attention. Finally, we step back from the specifics to reflect on the ceilings of the NPU programming model: what ONNX can and cannot express, and the potential transformation that a CuTe-style DSL could bring.

Executing a 32-Layer Transformer: One Blob, No Loops

CMX Capacity Constraints

As mentioned in the NPU Architecture Overview, the NPU’s on-chip high-speed storage CMX has limited capacity (KB~MB scale). A single layer of a 7B-parameter LLM can have weights on the order of tens of MB — far more than CMX can hold.

The solution is streaming: load only one layer’s weights from DDR into CMX at a time, compute, release the CMX space, then load the next layer. This stands in stark contrast to GPUs, which load all weights into VRAM at once and access them directly during computation.

Compile-Time Unrolling: No Loops, Only a Task List

npu_compiler fully unrolls the 32-layer Transformer into a flat task list — potentially containing thousands of DMA + DPU + SHAVE tasks. There are no explicit boundaries between layers and no host-side for loops.

The entire blob execution flow is:

The host loads the blob into NPU-accessible memory
The host submits a pointer (the MappedInference struct) to the NPU
The NPU runs fully autonomously: the management core (RISC-V on 40xx) reads task descriptors one by one, checks barrier conditions, and dispatches to DMA/DPU/SHAVE
After all tasks complete, the management core writes the fence value to notify the host

This contrasts sharply with the GPU’s kernel launch model: on a GPU, each kernel is a host-to-device dispatch, and the host always retains control; the NPU is a single submission followed by complete hands-off until all layers finish executing.

Analogy: A GPU is like a director calling “Action” shot by shot on set. An NPU is like having the entire film’s shooting plan laid out in advance — once you hit play, the director can walk away.

DMA/Compute Pipeline Overlap

The Core Optimization: Hiding Transfer Latency

If DMA, DPU, and SHAVE executed strictly in sequence (transfer data -> compute matrix ops -> compute activations -> transfer next layer…), a huge amount of time would be wasted waiting. The NPU’s key performance optimization is three parallel pipelines:

While DMA transfers layer N+1’s weights, DPU is computing layer N
While SHAVE executes layer N’s activation functions, DMA may already be transferring layer N+2’s data

This overlap is achieved through barrier synchronization: the management core checks each barrier’s producer/consumer counts and only dispatches subsequent tasks once their prerequisites are complete.

NPU 流水线重叠执行

t = 0 / 24

串行总时间

单位时间

流水线总时间

单位时间

加速比

1.6x

DMADPUSHAVE虚线 = 同步屏障（数据依赖）

FeasibleMemoryScheduler: Four Core Decisions at Compile Time

This pipeline parallelism is not determined dynamically at runtime — it is fully planned at compile time by the FeasibleMemoryScheduler:

1. CMX Capacity Management

A linear scan algorithm tracks CMX usage at each time step, ensuring it never exceeds physical capacity. Each DMA-in operation increases usage; each completed computation releases the corresponding space.

2. Prefetch Depth

The prefetchingLevelLimit parameter controls how many steps ahead DMA begins transferring data. Greater prefetch depth means better latency hiding but also higher CMX pressure — a classic trade-off.

3. Dynamic Spilling

When CMX is full but new data needs to be loaded, the compiler selects data that is not immediately needed and temporarily writes it back to DDR (spill out), freeing CMX space. When needed later, it is loaded back (spill in).

4. Ping-Pong Buffering

Two CMX buffers alternate: one is read by DPU/SHAVE for computation while DMA writes new data into the other. Computation and transfer fully overlap with no waiting required.

Comparison with GPU

There is an interesting symmetry here (related to the CUDA Programming Model):

	NPU	GPU
Goal	Hide DDR->CMX transfer latency	Hide Global Memory->SM latency
Mechanism	DMA prefetch (planned at compile time)	Stream + async memcpy (scheduled at runtime)
Decision timing	Fully determined at compile time	Dynamically decided at runtime by the warp scheduler
Analogy	Playing a pre-recorded vinyl record	A jazz band improvising live

The concept is the same — hiding memory latency. But the mechanisms are opposite — GPUs use dynamic runtime scheduling, while NPUs have everything planned at compile time (zero runtime decisions).

Three Attention Paths on the NPU

Attention is the core computation of the Transformer and the part that the NPU compiler must optimize most carefully. Depending on the inference phase and hardware capabilities, the compiler chooses among three implementation paths:

Path 1: Decompose SDPA

The fallback path when the target hardware does not support a dedicated attention operator. It decomposes Scaled Dot-Product Attention into a sequence of independent operations:

$Q \times K^T$ -> DPU (NCE MatMul)
$\times \text{scale}$ -> DPU (PPE, fused in MatMul post-processing)
$+ \text{attention\_mask}$ -> DPU/SHAVE (PPE can only add per-channel constants; a 2D mask requires SHAVE)
$\text{Softmax}$ -> SHAVE (no corresponding DPU fixed-function hardware)
$\times V$ -> DPU (NCE MatMul)

Intermediate results must be transferred between CMX/DDR (unless Vertical Fusion is active — see the next section).

Path 2: Flash SDPA

Suitable for long sequences in the prefill phase. The entire attention computation runs as a single SHAVE kernel, with mask handling fused internally.

The core idea: tile the KV cache along the seq_len dimension, compute local attention independently for each tile, while maintaining three rolling states:

running_output: The accumulated weighted output
running_max: The current maximum (for numerically stable online softmax)
running_sum: The normalization denominator

The UnrollFlashSDPA pass in the compiler unrolls a single FlashSDPA op into a chain of tiles.

This uses the same online softmax algorithm as the Flash Attention paper, but with different tiling constraints: the GPU version is limited by shared memory capacity, while the NPU version is limited by CMX capacity.

Path 3: Incremental SDPA

Suitable for the decode phase, where the query consists of only 1 token. $Q \times K^T$ degenerates from a matrix multiplication to a vector-matrix multiplication, which can be implemented with a specially optimized SHAVE kernel with mask handling inside the kernel.

The evolution of these three paths follows this trajectory: first the general sdpa, then the decode-specialized incremental_sdpa, then the prefill-specialized flash_sdpa — progressively specializing as optimization demands grow.

Attention 实现路径选择

编译器如何为不同场景选择最优 attention 实现

推理阶段

Prefill（预填充）Decode（解码）

长序列 + 硬件支持专用 kernel？

是否

Flash SDPA

整个 attention 作为单个 SHAVE kernel 执行

KV cache 按 seq_len 分块，维护 running_max/sum 保证数值稳定

UnrollFlashSDPA pass 将一个 FlashSDPA op 展开为 tile 链

Decompose SDPA

回退方案：将 SDPA 拆分为基础算子分别执行

中间结果需要 CMX/DDR 传输

硬件支持专用 kernel？

是否

Incremental SDPA

Q 仅 1 个 token，矩阵乘退化为向量-矩阵乘

专用优化 SHAVE kernel，mask 内部处理

Decompose SDPA

回退方案：将 SDPA 拆分为基础算子分别执行

中间结果需要 CMX/DDR 传输

DPU矩阵运算单元

SHAVE向量/标量处理单元

DPU/SHAVE条件分配

编译器在编译时根据模型结构和目标硬件能力选择路径

Tiling and Vertical Fusion

DPU Tiling

When an operation’s input/output tensors are too large to fit entirely in CMX, the compiler splits them into multiple tiles along the H (height) or C (channel) dimension, processing each tile independently:

Each tile corresponds to a DPUVariant (different workloads under the same DPUInvariant)
DpuTiler automatically determines the split strategy based on CMX capacity and hardware alignment requirements
The split is transparent to the computation result — concatenating tile outputs produces exactly the same result as the unsplit version

SHAVE Tiling

TileActShaveKernelTask handles SHAVE task splitting. The key principle: prefer dimensions that do not produce strided memory access. Strided access means data is non-contiguous in memory, requiring additional DMA rearrangement operations at high cost.

Vertical Fusion

Vertical Fusion is one of the most impactful optimizations on the NPU (related to the operator fusion concept in the Graph Compilation Optimization learning path).

The PipeliningVFScheduling pass identifies consecutive operation sequences (e.g., MatMul -> RoPE -> SDPA) and fuses them into a “vertical fusion region.” The key benefit after fusion: intermediate results stay in CMX and do not need to be written back to DDR and read back.

This is particularly valuable for attention blocks: QKV projection outputs can be consumed directly in CMX by RoPE and SDPA, saving MB-scale DDR round trips.

Key difference from GPU operator fusion:

	NPU Vertical Fusion	GPU Operator Fusion
What is saved	DDR round-trip transfers (MB-scale data)	Kernel launch overhead (microsecond-scale)
Constraints	CMX capacity	Shared memory + register pressure
Decision timing	Compile time	Compile time (XLA/TVM) or manual (CUDA)

RoPE and Position IDs

Attention itself is unaware of token ordering; positional information is injected through position_ids and RoPE (Rotary Position Embedding) (for mathematical details, see the Positional Encoding article).

On the NPU, RoPE is implemented as a dedicated SHAVE kernel. The compiler’s fuse_rope pass recognizes Sin/Cos/Multiply patterns in the IR and fuses them into a single efficient RoPE operator.

A detail under static shapes: valid values in position_ids increment from 0, with padding positions filled with 0. The alignment depends on the phase:

Prefill (left-aligned — valid tokens first, padding after):

input_ids      = [t1, t2, t3, t4, 0, 0, 0, 0]
attention_mask = [1,  1,  1,  1,  0, 0, 0, 0]
position_ids   = [0,  1,  2,  3,  0, 0, 0, 0]

Generate (right-aligned — padding first, new token at the last position):

input_ids      = [0, 0, 0, 0, 0, 0, 0, token]
attention_mask = [1, 1, 1, 1, 1, 0, 0, 0]
position_ids   = [0, 0, 0, 0, 0, 0, 0, 4]

In the generate phase, the last value in position_ids equals the current token’s actual position in the sequence (i.e., the number of 1s in the attention_mask).

Practical Optimizations and Current Limitations

DynamicDataMask: The padding region contains zeros or garbage data, which is masked out for attention. But LayerNorm and reduction operations are not mask-aware — they include padding values in their computations, leading to incorrect results. The compiler uses DynamicDataMask to insert zeroing operations before these operators, ensuring padding does not corrupt the computation.

LM Head Separation: The final layer’s vocabulary projection (hidden_size x vocab_size) is a very large matrix multiplication. When vocab_size is large (e.g., 32000+), the NPU may not be faster than the CPU. NPUW can carve this operation out and execute it on the CPU instead.

Other Optimizations (Brief)

Prefix Caching: In multi-turn conversations, the system prompt’s KV cache can be cached and reused, skipping redundant prefill
Speculative Decoding support: trim_kvcache_for_speculative_decoding handles KV cache truncation for rejected tokens

Current Limitations

The following limitations all stem from the NPU’s static execution model:

Limitation	Cause	GPU Comparison
Fixed KV cache capacity	Blob shapes are immutable	PagedAttention pages on demand
batch_size = 1	Each batch size requires separate compilation	Continuous batching
KV cache transfer overhead (~512MB)	Prefill and generate are separate blobs	Direct access within the same kernel
Slow cold start (multi-variant compilation)	Multiple generate blobs	Runtime JIT compilation

Reflecting on the Programming Model: The Boundaries of ONNX and Lessons from CuTe

Four Cracks in “The Compiler Handles Everything”

Looking back at the preceding content, the NPU’s programming model is essentially “developers provide an ONNX model, and the compiler handles all optimization.” This model works well for standard scenarios, but in LLM inference it exposes four structural issues:

Each attention variant requires hand-written SHAVE kernels: sdpa -> incremental_sdpa -> flash_sdpa; new variants (Sliding Window, Cross Attention, Linear Attention) must wait for the compiler team to implement them, with turnaround measured in months
Side effects of static shapes require dedicated patches: DynamicDataMask is one example. A different KV cache management strategy might require new patches
Generate variants are brute-force enumeration of runtime dynamism: Currently only KV cache capacity is enumerated. If batch size and beam width also need enumeration, the combinatorial explosion is severe
KV cache transfer is the price of an architectural constraint: The prefill-to-generate copy exists because the two blobs do not share an address space. If operator authors could control memory layout, this copy could potentially be avoided

What ONNX Can and Cannot Express

ONNX (and OpenVINO IR) is fundamentally a computation graph description language:

Can express: What operations to perform (MatMul, Softmax, Add…) and the data dependencies between them

Cannot express:

Tiling strategies: The core of Flash Attention is not the Q x K x V operations themselves, but how to tile, and how tiles pass running_max and running_sum between each other
Memory layout preferences: [batch, heads, seq_len, dim] vs [batch, seq_len, heads, dim] for KV cache has a massive impact on DMA efficiency, but ONNX does not express this preference
Fusion decisions: Vertical Fusion is a huge optimization, but ONNX does not describe which intermediate results should stay on-chip

Lessons from CuTe: A Third Way

Currently, the NPU’s programming abstraction has only two extremes:

ONNX: Very high-level, does not express tiling — all optimization is left to the compiler
Raw SHAVE ASM: Very low-level, expresses everything — but the development barrier is extremely high

NVIDIA’s CuTe (the tile abstraction in CUTLASS 3.0) carved out a middle layer on the GPU: operator authors control tile sizes and loop structures, while the compiler handles DMA and hardware mapping.

Imagine if the NPU had a CuTe-style DSL:

Flash SDPA would not require the compiler team to hand-write SHAVE kernels; operator authors could directly write tiled attention
Tiling strategies could be tuned to model characteristics (long context -> large KV tiles, small models -> no tiling needed)
New algorithm validation cycles would shrink from months to days

Dimension	ONNX	CuTe-style DSL
What it expresses	Computation graph (what to do)	Tiled algorithm (how to tile, how to pass state)
Tiling	Not expressed, engine handles automatically	Operator author specifies tile shapes
DMA/memory transfers	Not expressed, engine handles automatically	Not expressed, compiler handles automatically
New operators	Wait for engine support (months)	Operator author writes directly (days)
Tuning space	Virtually zero	Tile sizes, loop ordering, partitioning strategies
Portability	Excellent (cross-hardware)	Good (across same-vendor hardware generations)
Development barrier	Very low (export model)	Moderate (requires understanding tiling concepts)

Key insight: Both CuTe and ONNX hide DMA, but CuTe exposes tiling. Tiling is the interface between algorithm and hardware — looking up, it depends on algorithmic knowledge (only the algorithm author knows how to pass state between tiles); looking down, it depends on hardware knowledge (only the compiler knows CMX capacity and DMA bandwidth).

NPU vs GPU 执行模型对比

点击维度切换对比视角

NPU (Intel Meteor Lake)

编译期完成全部规划
Blob 包含所有任务描述符 (task descriptors)
管理核 (management core) 顺序读取执行
运行时零决策 — 确定性执行
🎵 好比播放预先录制好的专辑

GPU (NVIDIA CUDA)

运行时动态调度
Host 发起 kernel launch
Warp scheduler 分配工作给 SM
动态资源分配 — 按需调整
🎷 好比现场即兴的爵士乐队

核心概念映射

NPU	GPU	说明
CMX	Shared Memory	高速片上暂存器
DMA prefetch	async memcpy + streams	隐藏内存延迟
Barrier	Stream event / __syncthreads	同步机制
Blob (ELF)	Kernel binary (cubin)	编译后的可执行体
Management core	Warp scheduler	任务分发单元
DPU	Tensor Cores	固定功能矩阵运算
SHAVE	CUDA Cores	可编程计算单元

A Balanced View

It is important to emphasize:

ONNX is the right choice for 99% of users. The optimization paths for standard operations (Conv, MatMul, attention) are already good enough
CuTe targets the 1% who are operator authors — but their productivity determines the pace of optimization inside inference engines
The two are complementary, not substitutes: ONNX users need not change anything, yet they benefit from faster SDPA implementations

The bottleneck in today’s NPU ecosystem is human capital: the number of people worldwide who can write SHAVE kernels + npu_compiler passes may not exceed a few dozen. Model authors cannot experiment with how new attention patterns perform on the NPU — they can only wait for the compiler team to implement the operators that are already supported. The NPU’s positioning (a low-power AI assistant in laptops) makes these limitations acceptable for now, but if it needs to support a broader model ecosystem, some form of operator programmability may be unavoidable.

Summary

This article traced a complete arc from hardware execution to the boundaries of the programming model:

Execution model: A 32-layer Transformer is unrolled into a flat task list; the NPU runs fully autonomously after a single submission
Pipeline overlap: Three pipelines (DMA/DPU/SHAVE) cooperate through barriers, with everything planned at compile time
Attention paths: The compiler selects the optimal implementation from three paths based on inference phase and hardware capabilities
Tiling and fusion: DPU/SHAVE tiling addresses CMX capacity limits; Vertical Fusion eliminates DDR round trips
Programming model reflection: ONNX does not express tiling; CuTe exposes tiling — this is the intermediate abstraction layer the NPU may need in the future

The NPU’s positioning: Low-power, single-user LLM inference on laptops. batch=1 is the norm, fixed KV cache capacity covers most conversational scenarios, and the core advantage lies in low power consumption and freeing up the GPU.

The ceiling of hardware is not just compute and bandwidth — it is also how many people the programming model allows to write efficient code for it.