Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

PagedAttention and Continuous Batching

PagedAttention and Continuous Batching

Updated 2026-04-06

The Memory Dilemma of KV Cache

In KV Cache Fundamentals, we learned that during inference, the Key and Value tensors at each layer must be cached to avoid redundant computation. However, the traditional implementation has a serious problem: it pre-allocates contiguous memory of max_seq_len for every request.

Suppose max_seq_len = 2048, but user requests only use 512 tokens on average. This means 75% of GPU memory is pre-allocated but never used. Worse yet, when KV Caches from multiple requests are allocated contiguously in GPU memory, the memory freed by completed requests creates fragmentation (external fragmentation), potentially preventing new requests from starting because no sufficiently large contiguous block can be found.

Memory Allocation: Pre-allocated vs PagedAttentionDrag slider to adjust actual sequence length and observe waste rate changes12825651210241536tokensPre-allocated (max=2048)Waste 75.0%Used 25.0%PagedAttention (block=16)Used 100.0%Sequence length 512 / max 2048 — PagedAttention uses 32 blocks512 slotsPre-allocated waste 75.0% vs PagedAttention waste 0.0%— saved 75.0% memory

Empirical data from Kwon et al. (2023) shows that in real serving scenarios, KV Cache consumes 60-80% of GPU memory, with over 60% wasted on pre-allocation and fragmentation.

Inspiration from Operating Systems

This problem closely resembles memory management in operating systems. Early operating systems allocated contiguous physical memory for each process, leading to the same fragmentation issues. The solution was virtual memory + paging: each process sees a contiguous virtual address space, while the OS maps virtual pages to scattered physical page frames behind the scenes.

PagedAttention brings this idea to KV Cache management:

OS ConceptPagedAttention Equivalent
Virtual pageLogical block (KV for a group of contiguous tokens)
Physical page framePhysical block (fixed-size region in GPU memory)
Page tableBlock Table (logical block → physical block mapping)
On-demand allocationNew physical blocks allocated only when new tokens arrive
Page sizeBlock size (typically 16 tokens)

Core Mechanism of PagedAttention

Each request’s KV Cache is divided into logical blocks, each storing Key/Value for a fixed number of tokens (e.g., 16). Logical blocks are mapped to physical blocks in GPU memory via a Block Table, and physical blocks need not be contiguous.

Block Table: Logical Block → Physical Block MappingClick a token to see which logical block it belongs to and which physical block it maps toThecatsatonthematandsleptforhoursLogical Block 0Logical Block 1Logical Block 2Block TableLogical BlockPhysical Block031721Physical Memory (GPU)P0P1← L2for hoursP2P3← L0The cat sat onP4P5P6P7← L1the mat and sleptP8P9Physical blocks in GPU memory need not be contiguous — like OS virtual memory paging

Key design choices:

  1. On-demand allocation: Instead of pre-allocating max_seq_len, the system checks whether the current block has room each time a new token is generated. A new physical block is allocated only when the current one is full
  2. No external fragmentation: Physical blocks are fixed-size and uniform, so any free block can be used by any request
  3. Minimal internal fragmentation: Waste occurs only in the last block of each request (the unfilled portion), averaging < block_size / 2 tokens
Memory Fragmentation ComparisonSwitch between different concurrent requests to observe fragmentation changes4 requests16 requests64 requestsInternal FragmentationUnused space from max_len pre-allocation40%Contiguous3%PagedAttentionExternal FragmentationMemory gaps between requests that cannot be utilized30%Contiguous0%PagedAttentionTotal Waste70%3%PagedAttention reduces external fragmentation to 0%, internal fragmentation only from last block (< block_size tokens)

Copy-on-Write

In beam search and parallel sampling scenarios, multiple candidate sequences share the same prefix. PagedAttention uses Copy-on-Write (CoW) optimization: multiple sequences’ Block Tables can point to the same physical block, and the block is copied only when a sequence needs to modify its contents.

Each physical block maintains a ref_count (reference count). A write when ref_count > 1 triggers a copy. This is identical to the Linux fork() + CoW mechanism.

Initial: Beam 1 & 2 Share Prefix
Prompt Phase — Physical Block SharingBeam 1Beam 2Block 02Block 12Both beams point to same physical blocks, ref_count = 2

Continuous Batching

With PagedAttention efficiently managing memory, the next step is optimizing request scheduling.

Traditional Static Batching (the approach before Orca): a batch of requests is grouped together, and the next batch is processed only after all requests in the current batch have completed. Short requests are held back by long ones, leaving the GPU largely idle during the wait.

Continuous Batching (iteration-level scheduling, proposed by Orca): after each decode iteration, completed requests immediately release their slots, and new requests from the waiting queue immediately fill in. No idle waiting — GPU utilization is maximized.

Continuous Batching Timeline3 GPU slots, dynamic entry/exit, released upon completionSlot 0Slot 1Slot 201234567891011121314ABCDED arrivalE arrivalPrefill (dark)Decode (light)New request arrivalRequests A/C complete, slots immediately filled by D/E — no idle wait, max GPU utilization

Performance Analysis

Throughput differences across three batching strategies at varying concurrency levels:

Throughput Comparison: Three Batching StrategiesHover to see values0255075100Throughput (req/s)148163264Concurrent RequestsStatic BatchDynamic BatchContinuous Batch

Key takeaways:

  • Low concurrency (1-4 requests): minimal difference between the three strategies, since the GPU is not fully utilized anyway
  • High concurrency (16+ requests): the advantage of continuous batching grows exponentially, as it avoids long-tail requests blocking the entire batch
  • Empirical results: with PagedAttention + continuous batching, vLLM achieves 2-4x higher throughput than HuggingFace Text Generation Inference, and 24x higher than raw transformers

Summary

PagedAttention solves the memory waste problem of KV Cache through the “virtual memory paging” concept, while Continuous Batching eliminates GPU idleness between requests through “iteration-level scheduling.” Together, they enable vLLM to serve more requests simultaneously with higher GPU efficiency — which is why vLLM has become the de facto standard for cloud-based LLM serving.

Further Reading