Prefill vs Decode Phases
Updated 2026-04-06
Introduction: LLM Inference Is Not a Single Step
When you ask ChatGPT a question, you notice two distinctly different phases:
- A brief wait — the model is processing your entire input (prompt)
- Streaming output — the model begins generating tokens one by one
These two phases correspond to two fundamentally different computation processes in LLM inference: Prefill and Decode. They differ not only in function but also in computational characteristics — one is compute-bound, the other is memory-bound.
Understanding the distinction between these two phases is the foundation for understanding LLM serving performance optimization.
Prefill Phase: Processing the Prompt in Parallel
The Prefill phase processes the user’s complete prompt, preparing for the generation phase.
Workflow
Given a prompt with tokens, the Prefill phase:
- Feeds all tokens into the model simultaneously
- For each Transformer layer, computes Query, Key, and Value for all tokens
- Performs full Self-Attention: each token attends to all preceding tokens
- Generates and caches the KV Cache — the subsequent Decode phase will reuse these caches
- Outputs the first generated token
Computational Characteristics
The core computation in Prefill is large-scale matrix multiplication. Taking the QKV projection in one Transformer layer as an example:
Where has shape , and each weight matrix has shape . Taking a single projection (e.g., ) as an example, this is a classic matrix-matrix multiplication (GEMM), with computational cost:
The amount of data to load (the weight matrix) is approximately:
Therefore, the Arithmetic Intensity (floating-point operations per byte of data loaded) is:
Using FP16 as an example (2 bytes), with a prompt length of :
This value is extremely high! Far exceeding the compute-to-bandwidth ratio of modern GPUs (e.g., ~312 FLOPs/Byte for A100), meaning Prefill is compute-bound — GPU compute power is the bottleneck, not memory bandwidth.
Prefill performs matrix×matrix (GEMM), Decode performs vector×matrix (GEMV). Key difference isdata reuse。
Each weight column reused by 4 rows → high data reuse
Each weight column used once → loaded and discarded
Key Characteristics
- All tokens processed in parallel: Fully utilizes GPU parallelism
- Large matrix operations: GEMM operations, which GPUs excel at
- High arithmetic intensity: Computation far exceeds data movement
- High GPU utilization: Can typically achieve high MFU (Model FLOPs Utilization)
Decode Phase: Autoregressive Token-by-Token Generation
After Prefill completes, the model enters the Decode phase, generating new tokens one at a time.
Workflow
Each Decode step:
- Takes the single token generated in the previous step as input
- Computes Query, Key, and Value for that token
- Appends the new K and V to the KV Cache
- Performs Attention using the new token’s Query against the full KV Cache
- Passes through the FFN layer, outputting a probability distribution over the next token
- Samples the next token, and repeats the process
Computational Characteristics
The core computation in Decode degrades to a vector-matrix multiplication (GEMV). Taking the QKV projection as an example:
Where has shape and has shape . The computational cost is:
The amount of data to load remains the same — the entire weight matrix must still be loaded:
Arithmetic Intensity:
Using FP16 as an example:
Only 1 FLOPs/Byte! Far below the GPU’s compute-to-bandwidth ratio, meaning Decode is memory-bound — memory bandwidth is the bottleneck, and most of the GPU’s compute power sits idle, waiting for data to be loaded.
Attention Is Also Memory-Bound During Decode
The Attention computation during Decode faces the same issue. The new token’s query vector must compute dot products with all keys in the KV Cache:
The KV Cache that must be loaded from GPU memory:
Arithmetic Intensity:
Extremely low arithmetic intensity. As sequence length grows, the KV Cache grows linearly, and loading overhead increases linearly as well.
Key Characteristics
- Only 1 token processed per step: Cannot leverage GPU’s massive parallelism
- Vector-matrix operations: GEMV operations
- Low arithmetic intensity: Data movement is on the same order as computation
- Extremely low GPU utilization: Most time is spent waiting for memory reads
Comparison: Computation Flow of the Two Phases
The diagram above clearly illustrates the core differences between the two phases:
- Prefill feeds all prompt tokens into the model in parallel, performing large-scale matrix multiplication — a compute-bound operation
- Decode processes only one token per step, performing vector-matrix multiplication and reading the full KV Cache — a memory-bound operation
Compute-bound vs Memory-bound: Arithmetic Intensity Analysis
Roofline Model
To understand why Prefill and Decode have different bottlenecks, we need the Roofline Model. The Roofline Model describes the relationship between two key hardware parameters:
- Peak compute (FLOPs/s): Maximum floating-point operations per second the GPU can execute
- Peak bandwidth (Bytes/s): Maximum data the GPU memory can transfer per second
Their ratio defines the compute-bandwidth balance point:
For NVIDIA A100 (SXM, FP16 Tensor Core):
Decision rule:
- If an operation’s Arithmetic Intensity , the operation is compute-bound
- If , the operation is memory-bound
Arithmetic Intensity Comparison Between the Two Phases
| Metric | Prefill ( tokens) | Decode (1 token) |
|---|---|---|
| Linear layer FLOPs | ||
| Linear layer Bytes | ||
| AI (linear layer, FP16) | ||
| Attention FLOPs | ||
| Attention Bytes | ||
| AI (Attention, FP16) | High | |
| Bottleneck type | Compute-bound | Memory-bound |
Using the A100 as an example (), with prompt length :
- Prefill: AI , compute-bound. GPU compute is fully utilized
- Decode: AI , memory-bound. The GPU utilizes only about of peak compute
This is why Decode is so inefficient — the vast majority of the GPU’s compute power sits idle, waiting for data to be transferred from GPU memory.
Impact of Batch Size
Increasing batch size can improve Decode’s arithmetic intensity. When performing Decode for requests simultaneously:
The weight matrix only needs to be loaded once, but computation is performed for each of the requests, multiplying the compute by . When ( on A100), Decode can also become compute-bound.
However, there are two practical limitations:
- KV Cache memory: Each request’s KV Cache consumes significant GPU memory, limiting the achievable batch size
- Latency constraints: A batch that’s too large increases latency for individual requests
This is precisely why KV Cache compression techniques like GQA/MQA are important — by reducing KV Cache size, they enable larger batch sizes, improving Decode phase efficiency.
Real-World Performance Impact: TTFT vs TPS
The two phases correspond to different user-facing metrics:
TTFT — Time To First Token
Definition: Time from when the user sends a request to when the first generated token is received.
TTFT is primarily determined by the Prefill phase. Influencing factors:
- Prompt length: Longer prompts mean more Prefill computation and higher TTFT
- GPU compute power: Prefill is compute-bound, so a faster GPU directly reduces TTFT
- Prefill computation scales approximately linearly with prompt length (the Attention portion is quadratic, but typically FFN dominates)
TPS — Tokens Per Second
Definition: Number of tokens generated per second during the Decode phase.
TPS is determined by the Decode phase. Influencing factors:
- Memory bandwidth: Decode is memory-bound, so higher bandwidth means faster TPS
- Model size: More parameters mean more weights to load per step
- KV Cache size: Longer sequences mean more data loaded during the Attention step
Numerical Estimates
Using LLaMA-2 7B (~ bytes in FP16) on A100 as an example:
Decode TPS estimate (memory-bound, ignoring KV Cache):
In practice, due to KV Cache loading, kernel launch overhead, and other factors, real-world values are typically around 100-130 tokens/s (batch size = 1).
Prefill speed estimate (compute-bound):
This means Prefill processes the prompt at a throughput over 100x higher than Decode — which is why you experience “a brief wait followed by fast streaming” rather than “uniformly slow output.”
Optimization Directions
Given the different computational characteristics of the two phases, the industry has developed different optimization strategies:
Prefill Optimization
- Flash Attention: Although Prefill is overall compute-bound, standard Attention implementation repeatedly writes/reads the intermediate matrix to/from HBM, causing unnecessary memory traffic. Flash Attention performs softmax and matrix multiplication in tiled blocks within SRAM, avoiding materializing the intermediate matrix and reducing Attention HBM accesses from to
- Tensor Parallelism: Distributes matrix operations across multiple GPUs, increasing throughput for compute-bound operations
- Quantization: Uses INT8/FP8 lower precision to achieve higher effective compute on the same hardware
Decode Optimization
- KV Cache Compression: GQA and MQA reduce KV Cache size, lowering memory bandwidth requirements
- Speculative Decoding: Uses a small model to quickly “guess” multiple tokens, then verifies them with the large model in a single pass, merging multiple Decode steps into one Prefill-like parallel verification
- Continuous Batching: Dynamically assembles batches to improve GPU utilization
Hybrid Optimization
- Chunked Prefill: Splits long prompts into chunks and interleaves Decode steps in between, preventing long-prompt Prefill from blocking other requests’ Decode operations
- Disaggregated Serving: Deploys Prefill and Decode on different hardware — compute-intensive GPUs for Prefill, high-bandwidth devices for Decode
Recommended Learning Resources
If you want to dive deeper into LLM inference optimization, here are our curated resources:
Classic Papers
- Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention” — The core vLLM paper (SOSP 2023), proposing OS virtual memory-like management for KV Cache with near-zero waste. Essential reading for understanding modern LLM serving memory management.
- Lilian Weng “Large Transformer Model Inference Optimization” — A systematic survey of Transformer inference optimization methods, covering distillation, quantization, pruning, sparsification, and architecture optimization. One of the best overviews in the inference optimization field.
Blog Posts and Tutorials
- kipply “Transformer Inference Arithmetic” — A classic blog post on inference performance analysis. Derives LLM inference latency from first principles, covering KV Cache mechanics, model parallelism, batch size impact, FLOPS calculations, and how to determine memory bandwidth bound vs compute bound.
- Patrick von Platen / Hugging Face “Optimizing your LLM in production” — Covers low-precision inference (8-bit/4-bit), Flash Attention, KV Cache optimization (MQA/GQA), positional encodings, and includes detailed memory calculations and speedup data. Highly practical.
- Anyscale “How continuous batching enables 23x throughput in LLM inference” — Explains the continuous batching mechanism in detail, comparing throughput differences with static batching (up to 23x), benchmarking HF TGI, vLLM, Ray Serve, and other frameworks.
- Finbarr Timbers “How is LLaMa.cpp possible?” — Analyzes why LLM inference on consumer hardware is feasible. Uses mathematical derivation to show that memory bandwidth is the bottleneck and how quantization dramatically reduces memory requirements. Includes performance calculations across different devices (A100/M1/M2).
Summary
| Concept | Description |
|---|---|
| Prefill Phase | Processes the full prompt in parallel, generates KV Cache, compute-bound |
| Decode Phase | Autoregressive token-by-token generation, reads KV Cache, memory-bound |
| Arithmetic Intensity | Prefill: (high) vs Decode: (low) |
| Roofline Model | AI is compute-bound, otherwise memory-bound |
| TTFT | Time To First Token, determined by Prefill |
| TPS | Tokens Per Second, determined by Decode |
| Core Tension | Decode’s AI is far below the hardware balance point, severely wasting GPU compute |
Core Intuition: The two phases of LLM inference are like “preparing food” and “serving dishes.” Prefill is like a chef processing all ingredients simultaneously (parallel, compute-intensive) — speed depends on the chef’s knife skills (GPU compute). Decode is like a waiter serving dishes one at a time (sequential, bandwidth-intensive) — speed depends on the conveyor belt from kitchen to table (memory bandwidth). Understanding this distinction is the starting point for understanding all LLM inference optimization techniques.