Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Speculative Decoding — Accelerating LLM Inference via Guessing

Speculative Decoding — Accelerating LLM Inference via Guessing

Updated 2026-04-06

The decode phase of LLM inference is memory-bound — each step generates only 1 token, leaving vast GPU compute capacity idle. The core bottleneck of autoregressive generation is sequential dependency: generating each token depends on the result of the previous one.

The central idea behind Speculative Decoding is: use a fast but less accurate method to “guess” multiple future tokens, then have the target large model verify all guesses in a single parallel forward pass. If the guesses are correct, a single round produces multiple tokens; if wrong, resample from the error position — while guaranteeing that the final output distribution is exactly identical to generating with the large model alone.

Motivation — Why Decode Is Slow

Recall the conclusions from prefill-vs-decode:

  • Prefill phase: Processes all prompt tokens in parallel, is compute-bound (GPU compute fully utilized)
  • Decode phase: Generates tokens one by one, each step performs GEMV (matrix-vector multiplication), is memory-bound

Each forward pass during decode requires loading the entire model weights from HBM to the compute units, yet only performs a minimal amount of computation (GEMV for a single token). GPU compute utilization is extremely low.

The core question: Can we generate tokens “in batch”? Direct batching is impossible (due to autoregressive dependency), but we can guess first, then verify — this is Speculative Decoding.

Draft-then-Verify — The Classic Approach

Draft Phase

Use a much smaller draft model (e.g., a 68M model paired with a 7B target model) to autoregressively generate KK candidate tokens. The small model is fast but less accurate.

Verify Phase

The target large model runs a single forward pass over the KK candidate tokens — just like the prefill phase, this is parallel, obtaining probability distributions for all KK positions at once.

Draft: Fast generation with small model
Draft Model (68M) — Fast autoregressiveTheq=0.40quickq=0.30brownq=0.25foxq=0.35

Draft model autoregressively generates K=4 candidate tokens (fast but less accurate)

Rejection Sampling — Guaranteeing Distribution Consistency

This is the most elegant part of Speculative Decoding — rejection sampling guarantees that the output distribution is exactly identical to generating with the large model alone:

For each position ii, compare the draft probability q(xi)q(x_i) with the target probability p(xi)p(x_i):

  • If q(x)p(x)q(x) \leq p(x): Accept — the draft’s conservative guess is safe
  • If q(x)>p(x)q(x) > p(x): Reject with probability 1p(x)/q(x)1 - p(x)/q(x)

After rejection, resample 1 token from the corrected distribution norm(max(0,p(x)q(x)))\text{norm}(\max(0, p(x) - q(x))).

Key guarantee: No matter how poor the draft model is, the final output distribution is exactly the same as generating with the target model alone. A poor draft model only reduces the speedup (more rejections), without affecting quality.

Speedup Analysis

Define acceptance rate α\alpha — the average probability that a draft token is accepted.

The expected number of tokens generated per round is:

E[tokens]=1αK+11αE[\text{tokens}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

The higher α\alpha is (more accurate draft) and the larger KK is (more guesses per round), the higher the speedup:

Expected tokens per round
3.69 tokens
(1 − 0.80^6) / (1 − 0.80) = 3.69
24681012345678910K (draft length)Expected Tokensα=0.5α=0.7α=0.8α=0.9α=0.95
α=0.5
α=0.7
α=0.8
α=0.9
α=0.95

Practical speedup is typically 2-3x (depending on draft model quality and target model size).

Medusa — Adding Multiple Heads at Inference Time

Cai et al. (2024) proposed an approach that does not require a separate draft model.

Core Improvement

Attach multiple lightweight prediction heads on top of the target model’s last-layer hidden state:

  • Head 1 predicts the next token
  • Head 2 predicts the token after that
  • Head K predicts the KK-th token

Tree Attention Verification

Predictions from multiple heads are combined into a candidate tree (rather than a single sequence), and a tree attention mask verifies all candidate paths in a single forward pass, selecting the longest accepted path:

Current
Head 1
Head 2
catconf: 1.00satconf: 0.72isconf: 0.61onconf: 0.68byconf: 0.31aconf: 0.55veryconf: 0.22

Medusa combines multiple heads into candidate tree, Tree Attention verifies all paths at once, selecting the longest accepted branch

Training

The target model is frozen, and only the additional prediction heads are trained. Training cost is low, requiring only a small amount of data and few parameters.

Advantages: No extra model needed, simple deployment, low training cost

Limitations: Heads are not jointly trained (added only at inference time), so prediction quality is limited — acceptance rate is lower than Eagle (discussed below)

Multi-Token Prediction (MTP) — Adding Multiple Heads at Training Time

Core Idea

The key difference from Medusa: the model is trained to simultaneously predict future tokens 1, 2, …, K. Each prediction head shares the backbone network and has its own independent output layer, with training loss = sum (or weighted sum) of all head losses.

Key Difference from Medusa

MedusaMTP
TrainingFreeze main model at inference, train heads separatelyJoint training, heads and backbone optimized together
Head qualityNever saw the main model’s training processCo-optimized with the backbone
AnalogyAn aftermarket guessing add-onFactory-built multi-step prediction capability
TrainingInferenceBackbone(shared)Input TokensBackbone(same)Input TokensHead 1Loss₁ (next)Head 1Verify (target)Head 2Loss₂ (next+1)Head 2Draft token 1Head 3Loss₃ (next+2)Head 3Draft token 2

MTP Core: Multiple prediction heads jointly optimized during training → directly reused as speculative draft during inference
Unlike Medusa: heads and backbone trained together, better prediction quality

Bridge from Training to Inference

Multi-head prediction during training translates directly to speculative drafting at inference time. No separate draft model is needed — the model itself is both drafter and verifier (self-speculative).

Practical Applications

  • DeepSeek-V3: Trains with MTP, then leverages MTP heads for speculative decoding at inference
  • Meta 2024: “Better & Faster LLMs via Multi-Token Prediction” systematically validated MTP’s effectiveness

Eagle — Feature-Level Drafting

Li et al. (2024) proposed the approach with the highest acceptance rate to date.

Core Insight

Drafting does not need to start from token embeddings — the target model’s last-layer hidden state already encodes rich contextual semantic information. Drafting directly from hidden state features provides far more information than token-level drafting, naturally yielding higher accuracy.

Traditional Draft ModelTokenEmbeddingLess info ↓Small model (68M)Independent paramsDraft TokensvsEagle — Feature-Level DraftingHidden StatesTarget last layerRich info ↑Lightweight Decoder1 layerDraft TokensHidden state contains full context semantics → Eagle acceptance rate ~10-15% higher than Medusa

Key insight: Token embedding only has token info, while hidden state encodes full context, semantic relations, syntactic patterns

Architecture

Eagle’s “auto-regression head” is a lightweight decoder layer:

  • Input: Target model’s top-layer hidden state + current token embedding
  • Output: Feature vector for the next position, mapped to token probabilities via the target model’s LM head
  • Autoregressive: Can use its own output to continue predicting further positions

Why Acceptance Rate Is Higher

  • Feature-level information >> token-level: Hidden states contain complete context, semantic relationships, and syntactic patterns
  • Token embeddings only carry the token’s own information, losing context
  • Experiments show Eagle’s acceptance rate is approximately 10-15% higher than Medusa’s

Eagle-2 Improvements

Introduces a context-aware dynamic draft tree:

  • Dynamically adjusts tree structure based on each node’s confidence
  • High-confidence branches expand deeper (more candidates)
  • Low-confidence branches are pruned early (saving verification compute)
  • Result: further speedup improvement over Eagle-1

Training

The target model is frozen, and only the lightweight decoder is trained (similar to Medusa’s training approach), but since the input is features rather than tokens, performance is better for the same amount of training.

Draft Tree — Tree-Structured Speculation

Both Medusa and Eagle-2 mentioned tree attention above — here we elaborate on the structure and verification mechanism of draft trees.

Why Trees Beat Sequences

Classic Draft-then-Verify generates a chain (sequence): once a position is rejected, all subsequent tokens are discarded. A draft tree maintains multiple candidate paths, verifying all paths in a single forward pass — rejection only affects a single branch, leaving other paths unaffected.

Tree-based Draft (Token Budget = 9)Thep=1.0catp=0.7satp=0.8onp=0.9downp=0.1isp=0.2herep=0.5dogp=0.3ranp=0.6AcceptedRejectedPruned (not verified)9 tokens verified4 accepted (44%)Tree vs Chain: 同样的 Token BudgetTree: 多条候选路径并行验证 → reject 只影响单条分支,其他路径不受影响Chain: 单一序列 → 一处 reject 后所有后续 token 作废,budget 利用率低

Tree Attention Mask

To verify the entire tree in a single forward pass, a special attention mask must be constructed: each node can only attend to nodes along its ancestor path (not all preceding nodes). This is sparser than a standard causal mask:

Standard Causal Mask每个 token 看到所有之前的 tokenThecatsatonisdogranThecatsatonisdogranTree Attention Mask每个 token 只看到自己的祖先路径ThecatsatonisdogranThecatsatonisdogranTree mask 允许不同分支并行验证: "is" 不会看到 "sat/on","dog" 不会看到 "cat" 分支一次 forward pass 验证所有路径 → 选择最长被接受路径

After verification, the longest accepted path in the tree is selected as output. Under a fixed token budget, tree structures yield significantly more expected accepted tokens than chain structures.

EAGLE-3 — Scaling Up Speculative Decoding

The EAGLE series currently achieves the highest acceptance rate among speculative decoding methods. The core evolution from EAGLE-1 to EAGLE-3:

EAGLE-1: Feature-Level Drafting
EAGLE-1: Hidden State → Feature → TokenDraft using target model hidden state — feature-level info > token-levelTarget ModelForward PassHidden StateTop-layer featureToken EmbeddingDraft HeadLightweight decoderfeatureembeddingDraft TokensT+1, T+2, ...Core Insight: Feature-level > Token-levelHidden state encodes full context semantics → acceptance rate 10-15% higher than MedusaLimitation: Draft stage depends on target model hidden state → must wait for target forward pass

Key Improvements in EAGLE-3

EAGLE-1/2 predict feature vectors (the target model’s hidden states), then map them to tokens. EAGLE-3 switches to direct token prediction — predicting the next token directly while incorporating multi-layer feature fusion from the target model, leveraging Training-Time Test techniques for better training data scaling.

EAGLE 1/2: Feature Prediction Pipeline
EAGLE 1/2: Hidden State → Feature → TokenDraft depends on target model hidden state — three-step processTarget ForwardExtract Hidden StateDraft Head (Feature→Token)Verify (Target)Serial dependency: Draft must wait for Target Hidden State → Cannot pipelineIteration process:Target #1Draft #1Target #2Draft #2Target #3Draft #3Each round: Target forward (slow) → Draft (fast) → Serial waitEAGLE-1 ~3.5x | EAGLE-2 ~4.5x (dynamic tree improves budget allocation)

In benchmarks, EAGLE-3 achieves 6.5x speedup, approximately 1.4x better than EAGLE-2. Within the SGLang inference framework, throughput improves by 1.38x at batch=64.

Lookahead Decoding — No Draft Model

Fu et al. (2024) proposed a fundamentally different approach.

Based on Jacobi Iteration

No draft model or extra heads required — based on Jacobi iteration, multiple future token positions are guessed simultaneously, verified in parallel, and iteratively refined until convergence.

How it works:

  1. Initialize: Randomly guess tokens at positions t+1,t+2,,t+Kt+1, t+2, \ldots, t+K
  2. Each step: Run the target model’s forward pass on all positions in parallel
  3. If a position’s prediction matches the guess, that position has “converged”
  4. Unconverged positions are replaced with new predictions, and iteration continues
  5. Typically converges in 2-3 iterations

Advantages: Zero extra parameters, zero training cost, plug-and-play with any model

Limitations: Speedup is typically lower than Medusa/Eagle (number of iterations is unpredictable), practically around 1.5-2x

Comparison Summary

MethodExtra ParamsTraining CostSpeedupUse Case
Draft-then-Verify
Separate draft model
Train draft model
2-3x
With small model
Medusa
Multiple lightweight heads
Low
2-3x
Quick deployment
MTP
Built-in at training
High (pretrain)
2-3x
Train new model
Eagle
Lightweight decoder
Low
3-4x
Max acceleration
Eagle-3
Lightweight draft model
Low
~6.5x
Max acceleration
Lookahead
None
Zero
1.5-2x
Plug-and-play

Selection Guide

  • Already have a paired small model -> Draft-then-Verify
  • Training a new model from scratch -> MTP (build multi-head prediction into pretraining)
  • Have an existing large model, want quick acceleration -> Eagle > Medusa > Lookahead
  • Don’t want to train anything -> Lookahead (plug-and-play)

All methods guarantee distribution consistency through rejection sampling — the acceleration is lossless, affecting only speed, not quality. The core difference lies in the source and quality of the draft, which determines the acceptance rate and final speedup.