PagedAttention and Continuous Batching
Updated 2026-04-06
The Memory Dilemma of KV Cache
In KV Cache Fundamentals, we learned that during inference, the Key and Value tensors at each layer must be cached to avoid redundant computation. However, the traditional implementation has a serious problem: it pre-allocates contiguous memory of max_seq_len for every request.
Suppose max_seq_len = 2048, but user requests only use 512 tokens on average. This means 75% of GPU memory is pre-allocated but never used. Worse yet, when KV Caches from multiple requests are allocated contiguously in GPU memory, the memory freed by completed requests creates fragmentation (external fragmentation), potentially preventing new requests from starting because no sufficiently large contiguous block can be found.
Empirical data from Kwon et al. (2023) shows that in real serving scenarios, KV Cache consumes 60-80% of GPU memory, with over 60% wasted on pre-allocation and fragmentation.
Inspiration from Operating Systems
This problem closely resembles memory management in operating systems. Early operating systems allocated contiguous physical memory for each process, leading to the same fragmentation issues. The solution was virtual memory + paging: each process sees a contiguous virtual address space, while the OS maps virtual pages to scattered physical page frames behind the scenes.
PagedAttention brings this idea to KV Cache management:
| OS Concept | PagedAttention Equivalent |
|---|---|
| Virtual page | Logical block (KV for a group of contiguous tokens) |
| Physical page frame | Physical block (fixed-size region in GPU memory) |
| Page table | Block Table (logical block → physical block mapping) |
| On-demand allocation | New physical blocks allocated only when new tokens arrive |
| Page size | Block size (typically 16 tokens) |
Core Mechanism of PagedAttention
Each request’s KV Cache is divided into logical blocks, each storing Key/Value for a fixed number of tokens (e.g., 16). Logical blocks are mapped to physical blocks in GPU memory via a Block Table, and physical blocks need not be contiguous.
Key design choices:
- On-demand allocation: Instead of pre-allocating max_seq_len, the system checks whether the current block has room each time a new token is generated. A new physical block is allocated only when the current one is full
- No external fragmentation: Physical blocks are fixed-size and uniform, so any free block can be used by any request
- Minimal internal fragmentation: Waste occurs only in the last block of each request (the unfilled portion), averaging < block_size / 2 tokens
Copy-on-Write
In beam search and parallel sampling scenarios, multiple candidate sequences share the same prefix. PagedAttention uses Copy-on-Write (CoW) optimization: multiple sequences’ Block Tables can point to the same physical block, and the block is copied only when a sequence needs to modify its contents.
Each physical block maintains a ref_count (reference count). A write when ref_count > 1 triggers a copy. This is identical to the Linux fork() + CoW mechanism.
Continuous Batching
With PagedAttention efficiently managing memory, the next step is optimizing request scheduling.
Traditional Static Batching (the approach before Orca): a batch of requests is grouped together, and the next batch is processed only after all requests in the current batch have completed. Short requests are held back by long ones, leaving the GPU largely idle during the wait.
Continuous Batching (iteration-level scheduling, proposed by Orca): after each decode iteration, completed requests immediately release their slots, and new requests from the waiting queue immediately fill in. No idle waiting — GPU utilization is maximized.
Performance Analysis
Throughput differences across three batching strategies at varying concurrency levels:
Key takeaways:
- Low concurrency (1-4 requests): minimal difference between the three strategies, since the GPU is not fully utilized anyway
- High concurrency (16+ requests): the advantage of continuous batching grows exponentially, as it avoids long-tail requests blocking the entire batch
- Empirical results: with PagedAttention + continuous batching, vLLM achieves 2-4x higher throughput than HuggingFace Text Generation Inference, and 24x higher than raw transformers
Summary
PagedAttention solves the memory waste problem of KV Cache through the “virtual memory paging” concept, while Continuous Batching eliminates GPU idleness between requests through “iteration-level scheduling.” Together, they enable vLLM to serve more requests simultaneously with higher GPU efficiency — which is why vLLM has become the de facto standard for cloud-based LLM serving.
Further Reading
- Want to learn about scheduling strategies and preemption mechanisms? Read Scheduling and Preemption
- Want to learn about prefix caching optimizations? Read Prefix Caching and RadixAttention