Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Prefix Caching and RadixAttention

Prefix Caching and RadixAttention

Updated 2026-04-06

The Motivation for Prefix Reuse

In PagedAttention, we solved the memory management problem of KV Cache. But there is an overlooked performance bottleneck: redundant computation.

Consider multi-turn conversations: each turn must process the full system prompt + all historical messages + the new message. As conversation turns increase, redundant prefill computation grows linearly — by the 10th turn, 90% of the tokens may have already been computed in the previous 9 turns.

Input Token SequenceSystem Prompt50 tokensWhat is PagedAt...30 tokensPagedAttention ...40 tokensHow is it relat...50 tokensThe core idea.....60 tokensCan you give a ...20 tokensPrefill ComputeNo Cache250 tokensWith Cache250 tokensEnable cache to see savings ↑System PromptHistoryNew MessageCached (skip)

Similar situations are common in few-shot prompting (all requests share the same examples), self-consistency (the same prompt sampled multiple times in parallel), and tree-of-thought (sharing reasoning prefixes).

The core idea is straightforward: if two requests share the same prefix, don’t discard the already-computed KV Cache — cache it for reuse by subsequent requests.

vLLM’s Automatic Prefix Caching

vLLM’s approach is elegant and simple — borrowing from content-addressable storage in file systems:

Blocking
Step 1: Token sequence blocking (block_size = 4)[You, are, a, helpful]Block 0[ assistant, ., What, is]Block 1[ Paged, Attention, ?]Block 2

Specifically:

  1. The token sequence is divided into blocks of block_size (typically 16 tokens)
  2. A hash is computed for each block: hash(block_i) = SHA256(layer_id, tokens[0..i*block_size], prefix_hash)
  3. A global Hash Table is consulted: on a hit, the corresponding physical KV block is reused directly; on a miss, it is computed and inserted

Key design: the hash value incorporates information from all preceding blocks (chained via prefix_hash), ensuring that only exactly identical prefixes match. This prevents false matches, but also means:

  • If two sequences differ even in just the first token, the hashes of all subsequent blocks will differ
  • Non-prefix sharing cannot be matched (e.g., sequences with the same middle section but different beginnings)

Core Idea of RadixAttention

SGLang takes a different path — using a Radix Tree to manage all cached KV prefixes.

A Radix Tree is a compressed version of a Trie: paths with single-child nodes are merged into a single node, greatly reducing tree depth. In RadixAttention, each node stores KV Cache blocks corresponding to a segment of the token sequence.

[SYS] You are helpfu...4 KV blocks | ref=1[USER] What is Paged...2 KV blocks | ref=1[ASST] PagedAttentio...3 KV blocks | ref=0Shared prefix node (ref > 1)Exclusive node

The tree structure naturally expresses sharing relationships between sequences:

  • The root node is typically the system prompt’s KV Cache — reused by nearly all requests
  • Internal nodes are segments of conversation history — multi-turn conversations share earlier turns
  • Leaf nodes are the unique new portions of each request

The lookup process is a Longest Prefix Match starting from the root: comparing token sequences node by node along the tree until a fork or the end of the match is reached. The matched portion directly reuses the KV Cache, and only the unmatched tail needs to be computed.

LRU Eviction

GPU memory is limited, so it is impossible to cache all historical KV. When memory runs low, RadixAttention uses an LRU strategy, evicting from leaf nodes first:

Memory: 16/20 blocks
[SYS] prompt4 blocks | 0s ago[USER] 问题A2 blocks | 30s ago[USER] 问题B2 blocks | 5s ago[ASST] 回答A13 blocks | 60s ago[ASST] 回答A22 blocks | 45s ago[ASST] 回答B13 blocks | 10s ago

Why start from leaf nodes? Because leaf nodes are typically the most specialized sequence tails with the lowest reuse probability. Root and internal nodes are shared prefixes referenced by multiple sequences — evicting them would affect more subsequent requests.

The LRU implementation is also natural: each node records the timestamp of its last access, and eviction selects the least recently accessed leaf node. When a leaf node is evicted, if its parent becomes a new leaf (no other children), the parent may be evicted in the next round — forming a bottom-up progressive eviction pattern.

Reuse Effectiveness Across Scenarios

Prefix caching effectiveness varies greatly across different usage patterns:

Cache Hit Rate by Dialogue PatternR1R2R3R4R5R6R7R8R9R10Single-turn0%0%0%5%0%0%2%0%0%0%Multi-turn0%40%55%65%72%76%79%82%84%86%Few-shot0%70%70%70%71%70%72%70%71%70%Tree-of-Thought0%50%50%35%60%45%55%40%50%65%0%< 30%30-60%60-80%> 80%

Observed patterns:

  • Single-turn independent requests: nearly zero cache hits (each request has a different prompt)
  • Multi-turn conversations: hit rate increases with each turn (the accumulated shared prefix grows longer)
  • Few-shot: stable high hit rate starting from the 2nd request (all requests share the same examples)
  • Tree-of-thought: fluctuating hit rate (depends on branch point positions and exploration paths)

This demonstrates that prefix caching is not a silver bullet — its benefits depend entirely on the degree of prefix overlap between requests.

vLLM vs SGLang Caching Comparison

Each approach has its strengths and weaknesses:

Dimension
vLLM (Hash-based)
SGLang (RadixAttention)
数据结构
Hash Table
Radix Tree
匹配方式
精确前缀匹配
最长前缀匹配
匹配粒度
Block 级别
Token 级别
淘汰策略
引用计数 + LRU
LRU 从叶节点开始
非前缀共享
不支持
有限支持
查找复杂度
O(1) per block
O(L) L=前缀长度
Click any row to expand details. vLLM also experimentally supports tree-based caching from v0.6.

The practical choice depends on the scenario:

  • High-concurrency API serving (many independent requests): vLLM’s hash-based approach is simpler and more efficient
  • Multi-turn conversations / complex pipelines (extensive prefix sharing): SGLang’s RadixAttention offers more flexible reuse
  • Mixed scenarios: the performance gap between the two is narrowing, and vLLM is also experimenting with tree-based caching

Summary

The core of prefix caching is a simple intuition: avoid recomputing KV Cache that has already been computed. vLLM uses a hash table for clean, efficient exact matching, while SGLang uses a radix tree for flexible longest prefix matching. Both use LRU eviction to manage limited GPU memory.

From a broader perspective, prefix caching is the key step in inference engines moving from “single-request optimization” to “cross-request optimization” — it lets the engine exploit computational redundancy across requests, which yields far greater benefits in real deployments (multi-turn conversations, batch processing, pipelines) than optimizing PagedAttention alone.

Further Reading