LLM Inference on NPU: KV Cache and the Software Stack
Updated 2026-04-15
Introduction
In the previous article, we explored the NPU’s NCE architecture (DPU + SHAVE), the CMX/DDR two-level memory hierarchy, and the role of the management core. This article dives into the central challenge of NPU inference: LLM KV cache naturally grows dynamically, but the NPU can only execute computation graphs with static shapes — how is this contradiction resolved?
Starting from the KV cache contradiction, we will dissect layer by layer how the three-level software stack (openvino.genai, NPUW, npu_compiler) works together, understand the design of the prefill/generate dual-model approach, and trace the complete inference call path from host to NPU. Finally, we tie everything together with an end-to-end example.
KV Cache Recap and the NPU Contradiction
In Transformer autoregressive generation, each new token requires the Key and Value vectors from all previous tokens to compute attention. To avoid redundant computation, these vectors are cached — this is the KV cache (for detailed principles, see the Prefill vs Decode article).
On GPUs, KV cache management is relatively straightforward — memory can be dynamically allocated at runtime, and vLLM’s PagedAttention can even page on demand. But the NPU’s execution model is fundamentally different from the GPU’s:
The NPU’s blob execution model requires everything to be determined at compile time:
- The exact shape of every tensor (specific values for each dimension)
- The memory address of each data block in CMX/DDR
- The timing and ordering of all DMA transfers
- DPU task descriptors, SHAVE kernel machine code, barrier configurations
The compiled artifact is called a blob — its format is standard ELF (Executable and Linkable Format), containing all the information above. At runtime, the NPU management core executes task descriptors from the blob sequentially, making no decisions whatsoever.
This means the seq_len dimension of the KV cache tensor [batch, heads, seq_len, head_dim] must be a compile-time constant. But by nature, the KV cache grows incrementally during generation — at the 1st token seq_len=1, at the 100th token seq_len=100.
The core contradiction: KV cache is inherently dynamic, but the NPU can only execute computation graphs with static shapes.
The only thing that can vary at runtime is the address of input/output tensors (via the ELF relocation mechanism), but shapes can never change.
The Solution: Pre-allocation + Attention Mask
The solution is surprisingly intuitive: since shapes cannot change, pre-allocate a fixed-size buffer and turn “dynamic growth” into “moving the write position within a fixed space.”
NPUW (NPU Wrapper) controls the buffer size with two parameters:
MAX_PROMPT_LEN: maximum prompt length (default 1024)MIN_RESPONSE_LEN: reserved generation space (default 128)- Total capacity = 1024 + 128 = 1152
The KV cache tensor is always allocated as [batch, heads, 1152, head_dim] — regardless of how many positions are actually in use.
The Role of Attention Mask
Unused positions in the buffer contain zeros or garbage data, and we must ensure this padding does not affect the attention computation. This is achieved through the attention mask — a 0/1 vector of the same length as seq_len:
1indicates a valid data position0indicates a padding position
During the softmax step of the attention computation, positions where mask=0 are set to . After softmax, the corresponding weights become zero, completely ignoring the padding content.
Concrete Example
Suppose the prompt is “What is NPU” (4 tokens) and the buffer capacity is 1152:
When generating the 1st token:
- KV cache:
[K1, K2, K3, K4, 0, 0, ..., 0](4 valid + 1148 padding) - Mask:
[1, 1, 1, 1, 0, 0, ..., 0]
When generating the 100th token:
- KV cache:
[K1, K2, ..., K103, 0, ..., 0](103 valid + 1049 padding) - Mask:
[1, 1, ..., 1, 0, ..., 0](103 ones)
Key insight: The physical size never changes (always 1152). What changes is only the valid boundary and the number of 1s in the mask. The NPU executes the exact same blob every time — only the input data (input_ids, mask, position_ids, KV cache contents) differs.
The Three-Layer Software Stack
NPU-based LLM inference involves three software layers, each with clearly defined responsibilities:
openvino.genai (Top Layer): Application Framework
The user-facing high-level interface, centered on StatefulLLMPipeline:
- Tokenization: Converts user input text into token ID sequences
- Sampling strategies: Greedy search, Top-K, Top-P, and other decoding strategies
- Chat history management: Context concatenation and truncation for multi-turn conversations
- Flow control: When to prefill, when to decode, when to truncate and re-prefill due to excessive history length
The genai layer does not care whether the underlying device is a GPU or NPU — it simply calls OpenVINO’s inference interface.
NPUW (Middle Layer): NPU Wrapper, the Core Scheduler
NPUW is the “brain” of the entire NPU LLM inference pipeline, responsible for translating dynamic LLM inference requirements into static execution that the NPU can understand:
- Model splitting: Splits a single dynamic-shape LLM model into two static-shape sub-models: prefill and generate
- Compilation management: Compiles each sub-model into an NPU blob (via npu_compiler)
- KV cache management: Buffer allocation, zeroing (for new conversations), and KV cache transfer between prefill and generate
- Chunked prefill: Handling long prompts through segmented processing
- Task submission: Submits inference tasks to the NPU via the Level Zero API
npu_compiler (Bottom Layer): The Compiler
The compiler is responsible for compiling OpenVINO IR (Intermediate Representation) into NPU blobs:
- Completely unaware of what KV cache is — it only sees tensors marked as “stateful” (ReadValue/Assign operation pairs)
- Converts stateful operations into ordinary input/output parameters of the blob
- Plans all DMA transfer timing, DPU/SHAVE task scheduling, and barrier synchronization
- Generates the blob in ELF format
In one sentence: genai decides when to infer, NPUW decides how to infer, and the compiler decides what the hardware executes.
NPUW’s Core Design: Two Models, One KV Cache
Why Two Models?
LLM inference has two phases — prefill and decode (see Prefill vs Decode for details) — and the input_ids lengths differ drastically between them:
- Prefill: Processes the entire prompt at once; the
input_idsseq_len can range from hundreds to thousands - Decode (Generate): Processes only 1 new token at a time; the
input_idsseq_len is fixed at 1
Since all tensor shapes in an NPU blob must be determined at compile time, a single blob cannot accommodate both seq_len values. NPUW’s solution is to compile two separate blobs:
| Prefill Model | Generate Model | |
|---|---|---|
| input_ids seq_len | 1024 | 1 |
| KV cache output | [batch, heads, 1024, head_dim] | [batch, heads, 1152, head_dim] |
| KV cache input | None (first generation) | [batch, heads, 1152, head_dim] |
| When executed | Upon receiving the prompt | During token-by-token generation |
Inference Flow
The entire inference process is a relay between the prefill and generate blobs:
- User inputs a prompt (e.g., “What is NPU”, 4 tokens)
- Call the prefill blob: Input
input_ids=[t1, t2, t3, t4, 0, ..., 0](padded to 1024), output the first token + KV cache (present tensors) - copy_kvcache(): Copy the prefill’s output KV cache into the generate model’s input location. This is a parallel copy of 64 tensors (32 layers x K + V), performing slice alignment:
prefill.present[0:N] -> generate.past[0:N] - Loop calling the generate blob: Each iteration inputs 1 token and outputs the next token + updated KV cache
- Termination: Upon encountering the EOS token or reaching the maximum length
KV Cache Update
During the generate loop, after each inference update_kvcache_for() only needs to copy the single newly added row into the past KV cache, and the num_stored_tokens counter increments by 1. This is much lighter than copy_kvcache() (which copies the entire prefill output).
Handling Reality: Generate Variants and Chunked Prefill
Generate Variants
The design above has an efficiency issue: the KV cache capacity is 1152, but if the prompt is only 20 tokens, each decode step still traverses the full 1152-length KV cache — most of which is invalid padding.
NPUW’s solution is to compile multiple generate variants, each with a different KV cache capacity:
| Variant | KV Cache Capacity |
|---|---|
| Variant 1 | 256 |
| Variant 2 | 512 |
| Variant 3 | 1024 |
| Variant 4 | 1152 (maximum) |
At runtime, select_generate_request() chooses the smallest sufficient variant:
- 20 prompt tokens + 128 reserved = 148 -> select 256
- 400 prompt tokens + 128 reserved = 528 -> select 1024
Memory optimization: All variants share the same memory buffer. The largest variant (1152) allocates the entire block, and smaller variants are prefix slices of that block — no additional allocation needed.
Trade-off: Compile time increases (each variant requires compiling a separate blob). NPUW mitigates this through the EXPORT_BLOB mechanism, which caches compiled blobs to disk. On subsequent launches, blobs are loaded directly, skipping compilation.
NPUW Generate 变体选择模拟
所有变体共享同一块连续内存缓冲区。最大变体 (1152) 分配整块内存,较小变体是前缀切片。运行时选择能容纳 prompt + 128 预留空间的最小变体。
Chunked Prefill
When a prompt exceeds MAX_PROMPT_LEN (1024), a single prefill blob cannot fit it, and chunked processing is required:
- Suppose the prompt has 2048 tokens
- First prefill round: Process token[0:1024], write KV cache output to past
- Second prefill round: Process token[1024:2048], read past (from the first round) + write new present
- After each round, present is appended to past, accumulating KV state
- After all rounds complete: Execute copy_kvcache() to transfer the complete KV cache to the generate blob
This chunking mechanism ensures that prompts of any length can be processed (within the total capacity limit).
The Compiler: Turning Stateful into Stateless
The Stateful Mechanism in OpenVINO IR
In OpenVINO’s Intermediate Representation (IR), KV cache implements stateful inference through ReadValue / Assign operation pairs:
ReadValue("kv_k_layer0") -> Read the K cache saved from the previous inference from a "variable"
... compute new K cache ...
Assign("kv_k_layer0", new_value) -> Write the new K cache back to the "variable"
This mechanism makes the model appear to have “memory” — each inference automatically reads the previous state and updates it after computation.
The Compiler’s Transformation: Lambda Lifting
The ConvertAssignReadValueToReturnsAndInputs pass in npu_compiler performs a key transformation:
ReadValue-> function input parameter (KV cache passed in from outside)Assign-> function output value (updated KV cache returned as output)
After transformation, the blob is a pure function: KV cache comes in through the input, goes out through the output after being updated, and the blob itself holds no state.
This transformation has a specific name in compiler theory — lambda lifting: promoting implicitly captured mutable state (the ReadValue/Assign reads/writes to “variables”) into explicit input/output parameters of the function. Unlike SSA (Static Single Assignment), lambda lifting focuses on eliminating implicit dependencies on external mutable state.
State Management Moves to the Runtime
The responsibility of “remembering state” shifts from the compiler to NPUW’s ZeroVariableState:
- Holds Level Zero memory (device memory accessible by the NPU), storing the KV cache buffer
set_state()/get_state(): Reads/writes the KV cache before and after each inferencereset():memset(0)zeros the entire buffer when a new conversation begins
The benefit of this separation is that the compiler only needs to handle pure functions, and the runtime only needs to manage memory — each handles its own concern, with no interference.
From Host to NPU: A Single Inference Call
Now that we understand the software stack’s division of labor, let us trace the complete path of a single inference call — from blob loading to result retrieval.
Blob Loading (One-time, Completed at Startup)
- ELF parsing: The ELF Parser in the driver parses the blob file and creates a HostParsedInference (HPI) object
- Memory allocation: Allocates NPU memory with different attributes for different data types:
- Executable code (SHAVE kernels) -> WriteCombineFw memory
- SHAVE data segments -> WriteCombineShave memory
- DMA descriptors -> WriteCombineDma memory
- Static relocation: Patches cross-references inside the blob — replacing relative offsets with actual NPU device addresses
- Metadata extraction: Reads input/output tensor names, shapes, and data types for NPUW to use
NPUW caches compiled blobs to disk via the EXPORT_BLOB mechanism. On subsequent launches, blobs are loaded directly, skipping the time-consuming compilation process.
Per-Inference Path
Every inference (whether prefill or generate) goes through the following steps:
1. Prepare inputs
NPUW prepares four categories of input tensors:
input_ids: Token ID sequence (padded to 1024 for prefill, only 1 token for generate)attention_mask: 0/1 vector marking valid positionsposition_ids: Position encoding indices (corresponding to the number of 1s in the mask)past KV cache: The KV cache buffer from the previous round (all zeros for the first prefill)
2. JIT Relocation (applyInputOutput)
The blob does not know the actual address of the KV cache buffer at compile time — this address is determined at runtime. applyInputOutput traverses relocation entries in the blob marked as VPU_SHF_USERINPUT and writes the KV cache buffer’s NPU virtual address into the corresponding positions of DMA task descriptors.
This is like a book whose table of contents has blank page numbers at print time, which are filled in with the actual page numbers after binding.
3. Submit for Execution
The prepared command list is submitted to the NPU command queue. Under the hood, this is done via DRM_IVPU_CMDQ_SUBMIT ioctl (on Linux) or the Level Zero API.
4. NPU Autonomous Execution
The management core (RISC-V on 40xx) takes over:
- Reads task descriptors one by one
- Checks barrier synchronization conditions (producer/consumer counts)
- Dispatches to DMA / DPU / SHAVE once conditions are met
- Writes the fence value after all tasks complete
5. Host Detects Completion
The host has two wait strategies:
- Interrupt wait (
DRM_IVPU_BO_WAIT): CPU sleeps waiting for an interrupt, saving power - Polling wait (
UMONITOR/UMWAIT): CPU actively monitors, achieving lower latency
6. Read Outputs
NPUW reads two categories of outputs:
- logits: The probability distribution over the next token, passed to the genai layer for sampling
- present KV cache: The KV cache produced by this inference;
update_kvcache_for()appends the new portion to the past buffer
Mutable Command Lists Optimization
By default, every inference requires recreating the command list. The Level Zero experimental extension ZE_experimental_mutable_command_list (requires Level Zero spec 1.9+) provides an optimization:
- First inference: Creates the complete command list
- Subsequent inferences: Only calls
updateMutableCommands()to update the changed tensor pointers
Analogy: On an already-recorded “script,” only a few parameters (tensor addresses) are changed, without re-recording the entire script. This reduces the CPU overhead of command list creation.
End-to-End Walkthrough
Let us use a concrete example to connect all the concepts above.
Setup: Input “Hello” (1 token, ID 15496), KV cache total capacity 1152, 2 generate variants compiled (256 and 1152). The model is a 32-layer Transformer.
Step 1: Initialization
NPUW clones the LLM model into two copies:
- Prefill model (input_ids seq_len = 1024)
- Generate model (input_ids seq_len = 1)
npu_compiler compiles 3 blobs: 1 prefill + 2 generate variants (256 and 1152). Each blob undergoes ELF parsing and static relocation.
Step 2: New Conversation Starts
memset(0)zeros all KV cache buffersselect_generate_request(1): Needs 1 + 128 = 129 positions -> selects the 256 variant
Step 3: Prefill
Inputs:
input_ids = [15496, 0, 0, ..., 0](1 valid token + 1023 padding)attention_mask = [1, 0, 0, ..., 0]
The NPU executes the 32-layer Transformer: each layer sequentially performs DMA weights into CMX -> DPU computes QKV projection -> SHAVE executes RoPE -> SHAVE executes SDPA -> DPU computes FFN.
Output: The first token (e.g., ”,”) + KV cache for all 32 layers (present tensors).
Step 4: Prefill to Generate Switchover
copy_kvcache() copies the prefill’s present KV cache into the generate(256) past buffer:
prefill.present[0:1] -> generate_256.past[0:1]
A parallel copy of 64 tensors (32 layers x K + V).
Step 5: Generate Loop
Each iteration:
input_ids = [0, ..., 0, token](right-aligned, only the last position contains the real token)position_ids = [0, ..., 0, N](N is the current token’s position in the sequence)attention_mask: The number of 1s increments (2 ones in round 1, 3 ones in round 2, …)
The NPU executes the generate blob -> outputs the next token. update_kvcache_for() writes only the newly added row of KV cache.
Step 6: Termination
Upon encountering the EOS token or reaching the 256 capacity limit -> detokenize -> return the complete text.
Summary and Outlook
The core idea of this article can be summarized in one sentence: Turn “dynamic growth” into “moving the write position within a fixed space.”
Behind this deceptively simple idea lies precision-engineered collaboration:
- NPUW splits models, manages buffers, and coordinates KV cache transfer between prefill and generate
- npu_compiler compiles stateful models into pure-function blobs through lambda lifting
- Level Zero / DRM driver provides JIT relocation and task submission mechanisms
- NPU management core executes autonomously according to the task list determined at compile time
Current Limitations
All of these limitations stem from the NPU’s static execution model:
- Fixed KV cache capacity: When capacity is exceeded, the only option is to truncate history and re-prefill; GPU’s PagedAttention has no such limitation
- batch_size = 1: No support for continuous batching; cannot process multiple requests simultaneously
- KV cache transfer overhead: The copy_kvcache from prefill to generate can be on the order of ~512MB (32 layers x 2 x 32 heads x 1024 seq x 128 dim x 2 bytes FP16)
- Multiple compilations: Multiple generate variants = longer cold start times, mitigated by the blob caching mechanism
Next Steps
How exactly do these blobs execute on the NPU hardware? What scheduling decisions does the compiler make to hide memory latency? What implementation paths exist for attention on the NPU? Where are the ceilings of the programming model? These questions are explored in the next article.
Further Reading
- The OpenVINO GenAI Guide provides detailed instructions on using and configuring the StatefulLLMPipeline.
- The
ConvertAssignReadValueToReturnsAndInputspass in the npu_compiler source is the best entry point for understanding the stateful-to-stateless transformation. - The Level Zero Specification defines the low-level API for NPU device interaction, including command lists, fences, and the mutable command list extension.