LLM Inference on NPU: KV Cache and the Software Stack

Introduction

In the previous article, we explored the NPU’s NCE architecture (DPU + SHAVE), the CMX/DDR two-level memory hierarchy, and the role of the management core. This article dives into the central challenge of NPU inference: LLM KV cache naturally grows dynamically, but the NPU can only execute computation graphs with static shapes — how is this contradiction resolved?

Starting from the KV cache contradiction, we will dissect layer by layer how the three-level software stack (openvino.genai, NPUW, npu_compiler) works together, understand the design of the prefill/generate dual-model approach, and trace the complete inference call path from host to NPU. Finally, we tie everything together with an end-to-end example.

KV Cache Recap and the NPU Contradiction

In Transformer autoregressive generation, each new token requires the Key and Value vectors from all previous tokens to compute attention. To avoid redundant computation, these vectors are cached — this is the KV cache (for detailed principles, see the Prefill vs Decode article).

On GPUs, KV cache management is relatively straightforward — memory can be dynamically allocated at runtime, and vLLM’s PagedAttention can even page on demand. But the NPU’s execution model is fundamentally different from the GPU’s:

The NPU’s blob execution model requires everything to be determined at compile time:

The exact shape of every tensor (specific values for each dimension)
The memory address of each data block in CMX/DDR
The timing and ordering of all DMA transfers
DPU task descriptors, SHAVE kernel machine code, barrier configurations

The compiled artifact is called a blob — its format is standard ELF (Executable and Linkable Format), containing all the information above. At runtime, the NPU management core executes task descriptors from the blob sequentially, making no decisions whatsoever.

This means the seq_len dimension of the KV cache tensor [batch, heads, seq_len, head_dim] must be a compile-time constant. But by nature, the KV cache grows incrementally during generation — at the 1st token seq_len=1, at the 100th token seq_len=100.

The core contradiction: KV cache is inherently dynamic, but the NPU can only execute computation graphs with static shapes.

The only thing that can vary at runtime is the address of input/output tensors (via the ELF relocation mechanism), but shapes can never change.

The Solution: Pre-allocation + Attention Mask

The solution is surprisingly intuitive: since shapes cannot change, pre-allocate a fixed-size buffer and turn “dynamic growth” into “moving the write position within a fixed space.”

NPUW (NPU Wrapper) controls the buffer size with two parameters:

MAX_PROMPT_LEN: maximum prompt length (default 1024)
MIN_RESPONSE_LEN: reserved generation space (default 128)
Total capacity = 1024 + 128 = 1152

The KV cache tensor is always allocated as [batch, heads, 1152, head_dim] — regardless of how many positions are actually in use.

The Role of Attention Mask

Unused positions in the buffer contain zeros or garbage data, and we must ensure this padding does not affect the attention computation. This is achieved through the attention mask — a 0/1 vector of the same length as seq_len:

1 indicates a valid data position
0 indicates a padding position

During the softmax step of the attention computation, positions where mask=0 are set to $-\infty$ . After softmax, the corresponding weights become zero, completely ignoring the padding content.

Concrete Example

Suppose the prompt is “What is NPU” (4 tokens) and the buffer capacity is 1152:

When generating the 1st token:

KV cache: [K1, K2, K3, K4, 0, 0, ..., 0] (4 valid + 1148 padding)
Mask: [1, 1, 1, 1, 0, 0, ..., 0]

When generating the 100th token:

KV cache: [K1, K2, ..., K103, 0, ..., 0] (103 valid + 1049 padding)
Mask: [1, 1, ..., 1, 0, ..., 0] (103 ones)

Key insight: The physical size never changes (always 1152). What changes is only the valid boundary and the number of 1s in the mask. The NPU executes the exact same blob every time — only the input data (input_ids, mask, position_ids, KV cache contents) differs.

KV Cache 增长可视化

物理 buffer 大小不变，有效边界随 token 生成前移

总容量1152 tokens提示词长度128 tokens

当前写入位置: 128 / 1152

提示词区域生成区域未使用（填充）

The Three-Layer Software Stack

NPU-based LLM inference involves three software layers, each with clearly defined responsibilities:

openvino.genai (Top Layer): Application Framework

The user-facing high-level interface, centered on StatefulLLMPipeline:

Tokenization: Converts user input text into token ID sequences
Sampling strategies: Greedy search, Top-K, Top-P, and other decoding strategies
Chat history management: Context concatenation and truncation for multi-turn conversations
Flow control: When to prefill, when to decode, when to truncate and re-prefill due to excessive history length

The genai layer does not care whether the underlying device is a GPU or NPU — it simply calls OpenVINO’s inference interface.

NPUW (Middle Layer): NPU Wrapper, the Core Scheduler

NPUW is the “brain” of the entire NPU LLM inference pipeline, responsible for translating dynamic LLM inference requirements into static execution that the NPU can understand:

Model splitting: Splits a single dynamic-shape LLM model into two static-shape sub-models: prefill and generate
Compilation management: Compiles each sub-model into an NPU blob (via npu_compiler)
KV cache management: Buffer allocation, zeroing (for new conversations), and KV cache transfer between prefill and generate
Chunked prefill: Handling long prompts through segmented processing
Task submission: Submits inference tasks to the NPU via the Level Zero API

npu_compiler (Bottom Layer): The Compiler

The compiler is responsible for compiling OpenVINO IR (Intermediate Representation) into NPU blobs:

Completely unaware of what KV cache is — it only sees tensors marked as “stateful” (ReadValue/Assign operation pairs)
Converts stateful operations into ordinary input/output parameters of the blob
Plans all DMA transfer timing, DPU/SHAVE task scheduling, and barrier synchronization
Generates the blob in ELF format

In one sentence: genai decides when to infer, NPUW decides how to infer, and the compiler decides what the hardware executes.

NPUW’s Core Design: Two Models, One KV Cache

Why Two Models?

LLM inference has two phases — prefill and decode (see Prefill vs Decode for details) — and the input_ids lengths differ drastically between them:

Prefill: Processes the entire prompt at once; the input_ids seq_len can range from hundreds to thousands
Decode (Generate): Processes only 1 new token at a time; the input_ids seq_len is fixed at 1

Since all tensor shapes in an NPU blob must be determined at compile time, a single blob cannot accommodate both seq_len values. NPUW’s solution is to compile two separate blobs:

	Prefill Model	Generate Model
input_ids seq_len	1024	1
KV cache output	`[batch, heads, 1024, head_dim]`	`[batch, heads, 1152, head_dim]`
KV cache input	None (first generation)	`[batch, heads, 1152, head_dim]`
When executed	Upon receiving the prompt	During token-by-token generation

Inference Flow

The entire inference process is a relay between the prefill and generate blobs:

User inputs a prompt (e.g., “What is NPU”, 4 tokens)
Call the prefill blob: Input input_ids=[t1, t2, t3, t4, 0, ..., 0] (padded to 1024), output the first token + KV cache (present tensors)
copy_kvcache(): Copy the prefill’s output KV cache into the generate model’s input location. This is a parallel copy of 64 tensors (32 layers x K + V), performing slice alignment: prefill.present[0:N] -> generate.past[0:N]
Loop calling the generate blob: Each iteration inputs 1 token and outputs the next token + updated KV cache
Termination: Upon encountering the EOS token or reaching the maximum length

KV Cache Update

During the generate loop, after each inference update_kvcache_for() only needs to copy the single newly added row into the past KV cache, and the num_stored_tokens counter increments by 1. This is much lighter than copy_kvcache() (which copies the entire prefill output).

Phase:

Handling Reality: Generate Variants and Chunked Prefill

Generate Variants

The design above has an efficiency issue: the KV cache capacity is 1152, but if the prompt is only 20 tokens, each decode step still traverses the full 1152-length KV cache — most of which is invalid padding.

NPUW’s solution is to compile multiple generate variants, each with a different KV cache capacity:

Variant	KV Cache Capacity
Variant 1	256
Variant 2	512
Variant 3	1024
Variant 4	1152 (maximum)

At runtime, select_generate_request() chooses the smallest sufficient variant:

20 prompt tokens + 128 reserved = 148 -> select 256
400 prompt tokens + 128 reserved = 528 -> select 1024

Memory optimization: All variants share the same memory buffer. The largest variant (1152) allocates the entire block, and smaller variants are prefix slices of that block — no additional allocation needed.

Trade-off: Compile time increases (each variant requires compiling a separate blob). NPUW mitigates this through the EXPORT_BLOB mechanism, which caches compiled blobs to disk. On subsequent launches, blobs are loaded directly, skipping compilation.

NPUW Generate 变体选择模拟

Prompt 长度300 tokens

02048

所需容量

428

(prompt + 128 预留)

选中变体

512

利用率

83.6%

浪费

16.4%

内存布局

所有变体共享同一块连续内存缓冲区。最大变体 (1152) 分配整块内存，较小变体是前缀切片。运行时选择能容纳 prompt + 128 预留空间的最小变体。

Chunked Prefill

When a prompt exceeds MAX_PROMPT_LEN (1024), a single prefill blob cannot fit it, and chunked processing is required:

Suppose the prompt has 2048 tokens
First prefill round: Process token[0:1024], write KV cache output to past
Second prefill round: Process token[1024:2048], read past (from the first round) + write new present
After each round, present is appended to past, accumulating KV state
After all rounds complete: Execute copy_kvcache() to transfer the complete KV cache to the generate blob

This chunking mechanism ensures that prompts of any length can be processed (within the total capacity limit).

The Compiler: Turning Stateful into Stateless

The Stateful Mechanism in OpenVINO IR

In OpenVINO’s Intermediate Representation (IR), KV cache implements stateful inference through ReadValue / Assign operation pairs:

ReadValue("kv_k_layer0")  -> Read the K cache saved from the previous inference from a "variable"
... compute new K cache ...
Assign("kv_k_layer0", new_value) -> Write the new K cache back to the "variable"

This mechanism makes the model appear to have “memory” — each inference automatically reads the previous state and updates it after computation.

The Compiler’s Transformation: Lambda Lifting

The ConvertAssignReadValueToReturnsAndInputs pass in npu_compiler performs a key transformation:

ReadValue -> function input parameter (KV cache passed in from outside)
Assign -> function output value (updated KV cache returned as output)

After transformation, the blob is a pure function: KV cache comes in through the input, goes out through the output after being updated, and the blob itself holds no state.

This transformation has a specific name in compiler theory — lambda lifting: promoting implicitly captured mutable state (the ReadValue/Assign reads/writes to “variables”) into explicit input/output parameters of the function. Unlike SSA (Static Single Assignment), lambda lifting focuses on eliminating implicit dependencies on external mutable state.

State Management Moves to the Runtime

The responsibility of “remembering state” shifts from the compiler to NPUW’s ZeroVariableState:

Holds Level Zero memory (device memory accessible by the NPU), storing the KV cache buffer
set_state() / get_state(): Reads/writes the KV cache before and after each inference
reset(): memset(0) zeros the entire buffer when a new conversation begins

The benefit of this separation is that the compiler only needs to handle pure functions, and the runtime only needs to manage memory — each handles its own concern, with no interference.

From Host to NPU: A Single Inference Call

Now that we understand the software stack’s division of labor, let us trace the complete path of a single inference call — from blob loading to result retrieval.

Blob Loading (One-time, Completed at Startup)

ELF parsing: The ELF Parser in the driver parses the blob file and creates a HostParsedInference (HPI) object
Memory allocation: Allocates NPU memory with different attributes for different data types:
- Executable code (SHAVE kernels) -> WriteCombineFw memory
- SHAVE data segments -> WriteCombineShave memory
- DMA descriptors -> WriteCombineDma memory
Static relocation: Patches cross-references inside the blob — replacing relative offsets with actual NPU device addresses
Metadata extraction: Reads input/output tensor names, shapes, and data types for NPUW to use

NPUW caches compiled blobs to disk via the EXPORT_BLOB mechanism. On subsequent launches, blobs are loaded directly, skipping the time-consuming compilation process.

Per-Inference Path

Every inference (whether prefill or generate) goes through the following steps:

1. Prepare inputs

NPUW prepares four categories of input tensors:

input_ids: Token ID sequence (padded to 1024 for prefill, only 1 token for generate)
attention_mask: 0/1 vector marking valid positions
position_ids: Position encoding indices (corresponding to the number of 1s in the mask)
past KV cache: The KV cache buffer from the previous round (all zeros for the first prefill)

2. JIT Relocation (applyInputOutput)

The blob does not know the actual address of the KV cache buffer at compile time — this address is determined at runtime. applyInputOutput traverses relocation entries in the blob marked as VPU_SHF_USERINPUT and writes the KV cache buffer’s NPU virtual address into the corresponding positions of DMA task descriptors.

This is like a book whose table of contents has blank page numbers at print time, which are filled in with the actual page numbers after binding.

3. Submit for Execution

The prepared command list is submitted to the NPU command queue. Under the hood, this is done via DRM_IVPU_CMDQ_SUBMIT ioctl (on Linux) or the Level Zero API.

4. NPU Autonomous Execution

The management core (RISC-V on 40xx) takes over:

Reads task descriptors one by one
Checks barrier synchronization conditions (producer/consumer counts)
Dispatches to DMA / DPU / SHAVE once conditions are met
Writes the fence value after all tasks complete

5. Host Detects Completion

The host has two wait strategies:

Interrupt wait (DRM_IVPU_BO_WAIT): CPU sleeps waiting for an interrupt, saving power
Polling wait (UMONITOR/UMWAIT): CPU actively monitors, achieving lower latency

6. Read Outputs

NPUW reads two categories of outputs:

logits: The probability distribution over the next token, passed to the genai layer for sampling
present KV cache: The KV cache produced by this inference; update_kvcache_for() appends the new portion to the past buffer

Mutable Command Lists Optimization

By default, every inference requires recreating the command list. The Level Zero experimental extension ZE_experimental_mutable_command_list (requires Level Zero spec 1.9+) provides an optimization:

First inference: Creates the complete command list
Subsequent inferences: Only calls updateMutableCommands() to update the changed tensor pointers

Analogy: On an already-recorded “script,” only a few parameters (tensor addresses) are changed, without re-recording the entire script. This reduces the CPU overhead of command list creation.

End-to-End Walkthrough

Let us use a concrete example to connect all the concepts above.

Setup: Input “Hello” (1 token, ID 15496), KV cache total capacity 1152, 2 generate variants compiled (256 and 1152). The model is a 32-layer Transformer.

Step 1: Initialization

NPUW clones the LLM model into two copies:

Prefill model (input_ids seq_len = 1024)
Generate model (input_ids seq_len = 1)

npu_compiler compiles 3 blobs: 1 prefill + 2 generate variants (256 and 1152). Each blob undergoes ELF parsing and static relocation.

Step 2: New Conversation Starts

memset(0) zeros all KV cache buffers
select_generate_request(1): Needs 1 + 128 = 129 positions -> selects the 256 variant

Step 3: Prefill

Inputs:

input_ids = [15496, 0, 0, ..., 0] (1 valid token + 1023 padding)
attention_mask = [1, 0, 0, ..., 0]

The NPU executes the 32-layer Transformer: each layer sequentially performs DMA weights into CMX -> DPU computes QKV projection -> SHAVE executes RoPE -> SHAVE executes SDPA -> DPU computes FFN.

Output: The first token (e.g., ”,”) + KV cache for all 32 layers (present tensors).

Step 4: Prefill to Generate Switchover

copy_kvcache() copies the prefill’s present KV cache into the generate(256) past buffer:

prefill.present[0:1] -> generate_256.past[0:1]

A parallel copy of 64 tensors (32 layers x K + V).

Step 5: Generate Loop

Each iteration:

input_ids = [0, ..., 0, token] (right-aligned, only the last position contains the real token)
position_ids = [0, ..., 0, N] (N is the current token’s position in the sequence)
attention_mask: The number of 1s increments (2 ones in round 1, 3 ones in round 2, …)

The NPU executes the generate blob -> outputs the next token. update_kvcache_for() writes only the newly added row of KV cache.

Step 6: Termination

Upon encountering the EOS token or reaching the 256 capacity limit -> detokenize -> return the complete text.

Summary and Outlook

The core idea of this article can be summarized in one sentence: Turn “dynamic growth” into “moving the write position within a fixed space.”

Behind this deceptively simple idea lies precision-engineered collaboration:

NPUW splits models, manages buffers, and coordinates KV cache transfer between prefill and generate
npu_compiler compiles stateful models into pure-function blobs through lambda lifting
Level Zero / DRM driver provides JIT relocation and task submission mechanisms
NPU management core executes autonomously according to the task list determined at compile time

Current Limitations

All of these limitations stem from the NPU’s static execution model:

Fixed KV cache capacity: When capacity is exceeded, the only option is to truncate history and re-prefill; GPU’s PagedAttention has no such limitation
batch_size = 1: No support for continuous batching; cannot process multiple requests simultaneously
KV cache transfer overhead: The copy_kvcache from prefill to generate can be on the order of ~512MB (32 layers x 2 x 32 heads x 1024 seq x 128 dim x 2 bytes FP16)
Multiple compilations: Multiple generate variants = longer cold start times, mitigated by the blob caching mechanism

Next Steps

How exactly do these blobs execute on the NPU hardware? What scheduling decisions does the compiler make to hide memory latency? What implementation paths exist for attention on the NPU? Where are the ceilings of the programming model? These questions are explored in the next article.