NPU Execution Model and the Boundaries of Its Programming Model
Updated 2026-04-15
Introduction
In the previous article, we understood how the NPU runs LLMs through pre-allocated KV cache, two models (prefill + generate), and a three-layer software stack. But one key question remained unexplored: how exactly do these compiled blobs execute on the NPU hardware?
This article dives deep into the NPU’s execution model — how a 32-layer Transformer is unrolled into a flat task list, how three pipelines (DMA/DPU/SHAVE) cooperate through barriers to hide memory latency, and how the compiler selects the optimal implementation path for attention. Finally, we step back from the specifics to reflect on the ceilings of the NPU programming model: what ONNX can and cannot express, and the potential transformation that a CuTe-style DSL could bring.
Executing a 32-Layer Transformer: One Blob, No Loops
CMX Capacity Constraints
As mentioned in the NPU Architecture Overview, the NPU’s on-chip high-speed storage CMX has limited capacity (KB~MB scale). A single layer of a 7B-parameter LLM can have weights on the order of tens of MB — far more than CMX can hold.
The solution is streaming: load only one layer’s weights from DDR into CMX at a time, compute, release the CMX space, then load the next layer. This stands in stark contrast to GPUs, which load all weights into VRAM at once and access them directly during computation.
Compile-Time Unrolling: No Loops, Only a Task List
npu_compiler fully unrolls the 32-layer Transformer into a flat task list — potentially containing thousands of DMA + DPU + SHAVE tasks. There are no explicit boundaries between layers and no host-side for loops.
The entire blob execution flow is:
- The host loads the blob into NPU-accessible memory
- The host submits a pointer (the
MappedInferencestruct) to the NPU - The NPU runs fully autonomously: the management core (RISC-V on 40xx) reads task descriptors one by one, checks barrier conditions, and dispatches to DMA/DPU/SHAVE
- After all tasks complete, the management core writes the fence value to notify the host
This contrasts sharply with the GPU’s kernel launch model: on a GPU, each kernel is a host-to-device dispatch, and the host always retains control; the NPU is a single submission followed by complete hands-off until all layers finish executing.
Analogy: A GPU is like a director calling “Action” shot by shot on set. An NPU is like having the entire film’s shooting plan laid out in advance — once you hit play, the director can walk away.
DMA/Compute Pipeline Overlap
The Core Optimization: Hiding Transfer Latency
If DMA, DPU, and SHAVE executed strictly in sequence (transfer data -> compute matrix ops -> compute activations -> transfer next layer…), a huge amount of time would be wasted waiting. The NPU’s key performance optimization is three parallel pipelines:
- While DMA transfers layer N+1’s weights, DPU is computing layer N
- While SHAVE executes layer N’s activation functions, DMA may already be transferring layer N+2’s data
This overlap is achieved through barrier synchronization: the management core checks each barrier’s producer/consumer counts and only dispatches subsequent tasks once their prerequisites are complete.
FeasibleMemoryScheduler: Four Core Decisions at Compile Time
This pipeline parallelism is not determined dynamically at runtime — it is fully planned at compile time by the FeasibleMemoryScheduler:
1. CMX Capacity Management
A linear scan algorithm tracks CMX usage at each time step, ensuring it never exceeds physical capacity. Each DMA-in operation increases usage; each completed computation releases the corresponding space.
2. Prefetch Depth
The prefetchingLevelLimit parameter controls how many steps ahead DMA begins transferring data. Greater prefetch depth means better latency hiding but also higher CMX pressure — a classic trade-off.
3. Dynamic Spilling
When CMX is full but new data needs to be loaded, the compiler selects data that is not immediately needed and temporarily writes it back to DDR (spill out), freeing CMX space. When needed later, it is loaded back (spill in).
4. Ping-Pong Buffering
Two CMX buffers alternate: one is read by DPU/SHAVE for computation while DMA writes new data into the other. Computation and transfer fully overlap with no waiting required.
Comparison with GPU
There is an interesting symmetry here (related to the CUDA Programming Model):
| NPU | GPU | |
|---|---|---|
| Goal | Hide DDR->CMX transfer latency | Hide Global Memory->SM latency |
| Mechanism | DMA prefetch (planned at compile time) | Stream + async memcpy (scheduled at runtime) |
| Decision timing | Fully determined at compile time | Dynamically decided at runtime by the warp scheduler |
| Analogy | Playing a pre-recorded vinyl record | A jazz band improvising live |
The concept is the same — hiding memory latency. But the mechanisms are opposite — GPUs use dynamic runtime scheduling, while NPUs have everything planned at compile time (zero runtime decisions).
Three Attention Paths on the NPU
Attention is the core computation of the Transformer and the part that the NPU compiler must optimize most carefully. Depending on the inference phase and hardware capabilities, the compiler chooses among three implementation paths:
Path 1: Decompose SDPA
The fallback path when the target hardware does not support a dedicated attention operator. It decomposes Scaled Dot-Product Attention into a sequence of independent operations:
- -> DPU (NCE MatMul)
- -> DPU (PPE, fused in MatMul post-processing)
- -> DPU/SHAVE (PPE can only add per-channel constants; a 2D mask requires SHAVE)
- -> SHAVE (no corresponding DPU fixed-function hardware)
- -> DPU (NCE MatMul)
Intermediate results must be transferred between CMX/DDR (unless Vertical Fusion is active — see the next section).
Path 2: Flash SDPA
Suitable for long sequences in the prefill phase. The entire attention computation runs as a single SHAVE kernel, with mask handling fused internally.
The core idea: tile the KV cache along the seq_len dimension, compute local attention independently for each tile, while maintaining three rolling states:
running_output: The accumulated weighted outputrunning_max: The current maximum (for numerically stable online softmax)running_sum: The normalization denominator
The UnrollFlashSDPA pass in the compiler unrolls a single FlashSDPA op into a chain of tiles.
This uses the same online softmax algorithm as the Flash Attention paper, but with different tiling constraints: the GPU version is limited by shared memory capacity, while the NPU version is limited by CMX capacity.
Path 3: Incremental SDPA
Suitable for the decode phase, where the query consists of only 1 token. degenerates from a matrix multiplication to a vector-matrix multiplication, which can be implemented with a specially optimized SHAVE kernel with mask handling inside the kernel.
The evolution of these three paths follows this trajectory: first the general sdpa, then the decode-specialized incremental_sdpa, then the prefill-specialized flash_sdpa — progressively specializing as optimization demands grow.
Attention 实现路径选择
编译器如何为不同场景选择最优 attention 实现
整个 attention 作为单个 SHAVE kernel 执行
KV cache 按 seq_len 分块,维护 running_max/sum 保证数值稳定
UnrollFlashSDPA pass 将一个 FlashSDPA op 展开为 tile 链
回退方案:将 SDPA 拆分为基础算子分别执行
中间结果需要 CMX/DDR 传输
Q 仅 1 个 token,矩阵乘退化为向量-矩阵乘
专用优化 SHAVE kernel,mask 内部处理
回退方案:将 SDPA 拆分为基础算子分别执行
中间结果需要 CMX/DDR 传输
编译器在编译时根据模型结构和目标硬件能力选择路径
Tiling and Vertical Fusion
DPU Tiling
When an operation’s input/output tensors are too large to fit entirely in CMX, the compiler splits them into multiple tiles along the H (height) or C (channel) dimension, processing each tile independently:
- Each tile corresponds to a
DPUVariant(different workloads under the sameDPUInvariant) DpuTilerautomatically determines the split strategy based on CMX capacity and hardware alignment requirements- The split is transparent to the computation result — concatenating tile outputs produces exactly the same result as the unsplit version
SHAVE Tiling
TileActShaveKernelTask handles SHAVE task splitting. The key principle: prefer dimensions that do not produce strided memory access. Strided access means data is non-contiguous in memory, requiring additional DMA rearrangement operations at high cost.
Vertical Fusion
Vertical Fusion is one of the most impactful optimizations on the NPU (related to the operator fusion concept in the Graph Compilation Optimization learning path).
The PipeliningVFScheduling pass identifies consecutive operation sequences (e.g., MatMul -> RoPE -> SDPA) and fuses them into a “vertical fusion region.” The key benefit after fusion: intermediate results stay in CMX and do not need to be written back to DDR and read back.
This is particularly valuable for attention blocks: QKV projection outputs can be consumed directly in CMX by RoPE and SDPA, saving MB-scale DDR round trips.
Key difference from GPU operator fusion:
| NPU Vertical Fusion | GPU Operator Fusion | |
|---|---|---|
| What is saved | DDR round-trip transfers (MB-scale data) | Kernel launch overhead (microsecond-scale) |
| Constraints | CMX capacity | Shared memory + register pressure |
| Decision timing | Compile time | Compile time (XLA/TVM) or manual (CUDA) |
RoPE and Position IDs
Attention itself is unaware of token ordering; positional information is injected through position_ids and RoPE (Rotary Position Embedding) (for mathematical details, see the Positional Encoding article).
On the NPU, RoPE is implemented as a dedicated SHAVE kernel. The compiler’s fuse_rope pass recognizes Sin/Cos/Multiply patterns in the IR and fuses them into a single efficient RoPE operator.
A detail under static shapes: valid values in position_ids increment from 0, with padding positions filled with 0. The alignment depends on the phase:
Prefill (left-aligned — valid tokens first, padding after):
input_ids = [t1, t2, t3, t4, 0, 0, 0, 0]
attention_mask = [1, 1, 1, 1, 0, 0, 0, 0]
position_ids = [0, 1, 2, 3, 0, 0, 0, 0]
Generate (right-aligned — padding first, new token at the last position):
input_ids = [0, 0, 0, 0, 0, 0, 0, token]
attention_mask = [1, 1, 1, 1, 1, 0, 0, 0]
position_ids = [0, 0, 0, 0, 0, 0, 0, 4]
In the generate phase, the last value in position_ids equals the current token’s actual position in the sequence (i.e., the number of 1s in the attention_mask).
Practical Optimizations and Current Limitations
Optimizations Directly Related to the Static Execution Model
DynamicDataMask: The padding region contains zeros or garbage data, which is masked out for attention. But LayerNorm and reduction operations are not mask-aware — they include padding values in their computations, leading to incorrect results. The compiler uses DynamicDataMask to insert zeroing operations before these operators, ensuring padding does not corrupt the computation.
LM Head Separation: The final layer’s vocabulary projection (hidden_size x vocab_size) is a very large matrix multiplication. When vocab_size is large (e.g., 32000+), the NPU may not be faster than the CPU. NPUW can carve this operation out and execute it on the CPU instead.
Other Optimizations (Brief)
- Prefix Caching: In multi-turn conversations, the system prompt’s KV cache can be cached and reused, skipping redundant prefill
- Speculative Decoding support:
trim_kvcache_for_speculative_decodinghandles KV cache truncation for rejected tokens
Current Limitations
The following limitations all stem from the NPU’s static execution model:
| Limitation | Cause | GPU Comparison |
|---|---|---|
| Fixed KV cache capacity | Blob shapes are immutable | PagedAttention pages on demand |
| batch_size = 1 | Each batch size requires separate compilation | Continuous batching |
| KV cache transfer overhead (~512MB) | Prefill and generate are separate blobs | Direct access within the same kernel |
| Slow cold start (multi-variant compilation) | Multiple generate blobs | Runtime JIT compilation |
Reflecting on the Programming Model: The Boundaries of ONNX and Lessons from CuTe
Four Cracks in “The Compiler Handles Everything”
Looking back at the preceding content, the NPU’s programming model is essentially “developers provide an ONNX model, and the compiler handles all optimization.” This model works well for standard scenarios, but in LLM inference it exposes four structural issues:
- Each attention variant requires hand-written SHAVE kernels: sdpa -> incremental_sdpa -> flash_sdpa; new variants (Sliding Window, Cross Attention, Linear Attention) must wait for the compiler team to implement them, with turnaround measured in months
- Side effects of static shapes require dedicated patches: DynamicDataMask is one example. A different KV cache management strategy might require new patches
- Generate variants are brute-force enumeration of runtime dynamism: Currently only KV cache capacity is enumerated. If batch size and beam width also need enumeration, the combinatorial explosion is severe
- KV cache transfer is the price of an architectural constraint: The prefill-to-generate copy exists because the two blobs do not share an address space. If operator authors could control memory layout, this copy could potentially be avoided
What ONNX Can and Cannot Express
ONNX (and OpenVINO IR) is fundamentally a computation graph description language:
Can express: What operations to perform (MatMul, Softmax, Add…) and the data dependencies between them
Cannot express:
- Tiling strategies: The core of Flash Attention is not the Q x K x V operations themselves, but how to tile, and how tiles pass
running_maxandrunning_sumbetween each other - Memory layout preferences:
[batch, heads, seq_len, dim]vs[batch, seq_len, heads, dim]for KV cache has a massive impact on DMA efficiency, but ONNX does not express this preference - Fusion decisions: Vertical Fusion is a huge optimization, but ONNX does not describe which intermediate results should stay on-chip
Lessons from CuTe: A Third Way
Currently, the NPU’s programming abstraction has only two extremes:
- ONNX: Very high-level, does not express tiling — all optimization is left to the compiler
- Raw SHAVE ASM: Very low-level, expresses everything — but the development barrier is extremely high
NVIDIA’s CuTe (the tile abstraction in CUTLASS 3.0) carved out a middle layer on the GPU: operator authors control tile sizes and loop structures, while the compiler handles DMA and hardware mapping.
Imagine if the NPU had a CuTe-style DSL:
- Flash SDPA would not require the compiler team to hand-write SHAVE kernels; operator authors could directly write tiled attention
- Tiling strategies could be tuned to model characteristics (long context -> large KV tiles, small models -> no tiling needed)
- New algorithm validation cycles would shrink from months to days
| Dimension | ONNX | CuTe-style DSL |
|---|---|---|
| What it expresses | Computation graph (what to do) | Tiled algorithm (how to tile, how to pass state) |
| Tiling | Not expressed, engine handles automatically | Operator author specifies tile shapes |
| DMA/memory transfers | Not expressed, engine handles automatically | Not expressed, compiler handles automatically |
| New operators | Wait for engine support (months) | Operator author writes directly (days) |
| Tuning space | Virtually zero | Tile sizes, loop ordering, partitioning strategies |
| Portability | Excellent (cross-hardware) | Good (across same-vendor hardware generations) |
| Development barrier | Very low (export model) | Moderate (requires understanding tiling concepts) |
Key insight: Both CuTe and ONNX hide DMA, but CuTe exposes tiling. Tiling is the interface between algorithm and hardware — looking up, it depends on algorithmic knowledge (only the algorithm author knows how to pass state between tiles); looking down, it depends on hardware knowledge (only the compiler knows CMX capacity and DMA bandwidth).
NPU vs GPU 执行模型对比
点击维度切换对比视角
- 编译期完成全部规划
- Blob 包含所有任务描述符 (task descriptors)
- 管理核 (management core) 顺序读取执行
- 运行时零决策 — 确定性执行
- 🎵 好比播放预先录制好的专辑
- 运行时动态调度
- Host 发起 kernel launch
- Warp scheduler 分配工作给 SM
- 动态资源分配 — 按需调整
- 🎷 好比现场即兴的爵士乐队
| NPU | GPU | 说明 |
|---|---|---|
| CMX | Shared Memory | 高速片上暂存器 |
| DMA prefetch | async memcpy + streams | 隐藏内存延迟 |
| Barrier | Stream event / __syncthreads | 同步机制 |
| Blob (ELF) | Kernel binary (cubin) | 编译后的可执行体 |
| Management core | Warp scheduler | 任务分发单元 |
| DPU | Tensor Cores | 固定功能矩阵运算 |
| SHAVE | CUDA Cores | 可编程计算单元 |
A Balanced View
It is important to emphasize:
- ONNX is the right choice for 99% of users. The optimization paths for standard operations (Conv, MatMul, attention) are already good enough
- CuTe targets the 1% who are operator authors — but their productivity determines the pace of optimization inside inference engines
- The two are complementary, not substitutes: ONNX users need not change anything, yet they benefit from faster SDPA implementations
The bottleneck in today’s NPU ecosystem is human capital: the number of people worldwide who can write SHAVE kernels + npu_compiler passes may not exceed a few dozen. Model authors cannot experiment with how new attention patterns perform on the NPU — they can only wait for the compiler team to implement the operators that are already supported. The NPU’s positioning (a low-power AI assistant in laptops) makes these limitations acceptable for now, but if it needs to support a broader model ecosystem, some form of operator programmability may be unavoidable.
Summary
This article traced a complete arc from hardware execution to the boundaries of the programming model:
- Execution model: A 32-layer Transformer is unrolled into a flat task list; the NPU runs fully autonomously after a single submission
- Pipeline overlap: Three pipelines (DMA/DPU/SHAVE) cooperate through barriers, with everything planned at compile time
- Attention paths: The compiler selects the optimal implementation from three paths based on inference phase and hardware capabilities
- Tiling and fusion: DPU/SHAVE tiling addresses CMX capacity limits; Vertical Fusion eliminates DDR round trips
- Programming model reflection: ONNX does not express tiling; CuTe exposes tiling — this is the intermediate abstraction layer the NPU may need in the future
The NPU’s positioning: Low-power, single-user LLM inference on laptops. batch=1 is the norm, fixed KV cache capacity covers most conversational scenarios, and the core advantage lies in low power consumption and freeing up the GPU.
The ceiling of hardware is not just compute and bandwidth — it is also how many people the programming model allows to write efficient code for it.
Further Reading
- The
FeasibleMemorySchedulerin the npu_compiler source is the best entry point for understanding compile-time scheduling decisions. - The Flash Attention paper describes the original design of the online softmax algorithm and tiling strategy.
- CUTLASS 3.0 / CuTe demonstrates how tile abstraction on the GPU decouples algorithm from hardware.
- The CUDA Programming Model and the Graph Compilation Optimization learning path provide a GPU-side comparative perspective.