Speculative Decoding — Accelerating LLM Inference via Guessing
Updated 2026-04-06
The decode phase of LLM inference is memory-bound — each step generates only 1 token, leaving vast GPU compute capacity idle. The core bottleneck of autoregressive generation is sequential dependency: generating each token depends on the result of the previous one.
The central idea behind Speculative Decoding is: use a fast but less accurate method to “guess” multiple future tokens, then have the target large model verify all guesses in a single parallel forward pass. If the guesses are correct, a single round produces multiple tokens; if wrong, resample from the error position — while guaranteeing that the final output distribution is exactly identical to generating with the large model alone.
Motivation — Why Decode Is Slow
Recall the conclusions from prefill-vs-decode:
- Prefill phase: Processes all prompt tokens in parallel, is compute-bound (GPU compute fully utilized)
- Decode phase: Generates tokens one by one, each step performs GEMV (matrix-vector multiplication), is memory-bound
Each forward pass during decode requires loading the entire model weights from HBM to the compute units, yet only performs a minimal amount of computation (GEMV for a single token). GPU compute utilization is extremely low.
The core question: Can we generate tokens “in batch”? Direct batching is impossible (due to autoregressive dependency), but we can guess first, then verify — this is Speculative Decoding.
Draft-then-Verify — The Classic Approach
Draft Phase
Use a much smaller draft model (e.g., a 68M model paired with a 7B target model) to autoregressively generate candidate tokens. The small model is fast but less accurate.
Verify Phase
The target large model runs a single forward pass over the candidate tokens — just like the prefill phase, this is parallel, obtaining probability distributions for all positions at once.
Draft model autoregressively generates K=4 candidate tokens (fast but less accurate)
Rejection Sampling — Guaranteeing Distribution Consistency
This is the most elegant part of Speculative Decoding — rejection sampling guarantees that the output distribution is exactly identical to generating with the large model alone:
For each position , compare the draft probability with the target probability :
- If : Accept — the draft’s conservative guess is safe
- If : Reject with probability
After rejection, resample 1 token from the corrected distribution .
Key guarantee: No matter how poor the draft model is, the final output distribution is exactly the same as generating with the target model alone. A poor draft model only reduces the speedup (more rejections), without affecting quality.
Speedup Analysis
Define acceptance rate — the average probability that a draft token is accepted.
The expected number of tokens generated per round is:
The higher is (more accurate draft) and the larger is (more guesses per round), the higher the speedup:
Practical speedup is typically 2-3x (depending on draft model quality and target model size).
Medusa — Adding Multiple Heads at Inference Time
Cai et al. (2024) proposed an approach that does not require a separate draft model.
Core Improvement
Attach multiple lightweight prediction heads on top of the target model’s last-layer hidden state:
- Head 1 predicts the next token
- Head 2 predicts the token after that
- Head K predicts the -th token
Tree Attention Verification
Predictions from multiple heads are combined into a candidate tree (rather than a single sequence), and a tree attention mask verifies all candidate paths in a single forward pass, selecting the longest accepted path:
Medusa combines multiple heads into candidate tree, Tree Attention verifies all paths at once, selecting the longest accepted branch
Training
The target model is frozen, and only the additional prediction heads are trained. Training cost is low, requiring only a small amount of data and few parameters.
Advantages: No extra model needed, simple deployment, low training cost
Limitations: Heads are not jointly trained (added only at inference time), so prediction quality is limited — acceptance rate is lower than Eagle (discussed below)
Multi-Token Prediction (MTP) — Adding Multiple Heads at Training Time
Core Idea
The key difference from Medusa: the model is trained to simultaneously predict future tokens 1, 2, …, K. Each prediction head shares the backbone network and has its own independent output layer, with training loss = sum (or weighted sum) of all head losses.
Key Difference from Medusa
| Medusa | MTP | |
|---|---|---|
| Training | Freeze main model at inference, train heads separately | Joint training, heads and backbone optimized together |
| Head quality | Never saw the main model’s training process | Co-optimized with the backbone |
| Analogy | An aftermarket guessing add-on | Factory-built multi-step prediction capability |
MTP Core: Multiple prediction heads jointly optimized during training → directly reused as speculative draft during inference
Unlike Medusa: heads and backbone trained together, better prediction quality
Bridge from Training to Inference
Multi-head prediction during training translates directly to speculative drafting at inference time. No separate draft model is needed — the model itself is both drafter and verifier (self-speculative).
Practical Applications
- DeepSeek-V3: Trains with MTP, then leverages MTP heads for speculative decoding at inference
- Meta 2024: “Better & Faster LLMs via Multi-Token Prediction” systematically validated MTP’s effectiveness
Eagle — Feature-Level Drafting
Li et al. (2024) proposed the approach with the highest acceptance rate to date.
Core Insight
Drafting does not need to start from token embeddings — the target model’s last-layer hidden state already encodes rich contextual semantic information. Drafting directly from hidden state features provides far more information than token-level drafting, naturally yielding higher accuracy.
Key insight: Token embedding only has token info, while hidden state encodes full context, semantic relations, syntactic patterns
Architecture
Eagle’s “auto-regression head” is a lightweight decoder layer:
- Input: Target model’s top-layer hidden state + current token embedding
- Output: Feature vector for the next position, mapped to token probabilities via the target model’s LM head
- Autoregressive: Can use its own output to continue predicting further positions
Why Acceptance Rate Is Higher
- Feature-level information >> token-level: Hidden states contain complete context, semantic relationships, and syntactic patterns
- Token embeddings only carry the token’s own information, losing context
- Experiments show Eagle’s acceptance rate is approximately 10-15% higher than Medusa’s
Eagle-2 Improvements
Introduces a context-aware dynamic draft tree:
- Dynamically adjusts tree structure based on each node’s confidence
- High-confidence branches expand deeper (more candidates)
- Low-confidence branches are pruned early (saving verification compute)
- Result: further speedup improvement over Eagle-1
Training
The target model is frozen, and only the lightweight decoder is trained (similar to Medusa’s training approach), but since the input is features rather than tokens, performance is better for the same amount of training.
Draft Tree — Tree-Structured Speculation
Both Medusa and Eagle-2 mentioned tree attention above — here we elaborate on the structure and verification mechanism of draft trees.
Why Trees Beat Sequences
Classic Draft-then-Verify generates a chain (sequence): once a position is rejected, all subsequent tokens are discarded. A draft tree maintains multiple candidate paths, verifying all paths in a single forward pass — rejection only affects a single branch, leaving other paths unaffected.
Tree Attention Mask
To verify the entire tree in a single forward pass, a special attention mask must be constructed: each node can only attend to nodes along its ancestor path (not all preceding nodes). This is sparser than a standard causal mask:
After verification, the longest accepted path in the tree is selected as output. Under a fixed token budget, tree structures yield significantly more expected accepted tokens than chain structures.
EAGLE-3 — Scaling Up Speculative Decoding
The EAGLE series currently achieves the highest acceptance rate among speculative decoding methods. The core evolution from EAGLE-1 to EAGLE-3:
Key Improvements in EAGLE-3
EAGLE-1/2 predict feature vectors (the target model’s hidden states), then map them to tokens. EAGLE-3 switches to direct token prediction — predicting the next token directly while incorporating multi-layer feature fusion from the target model, leveraging Training-Time Test techniques for better training data scaling.
In benchmarks, EAGLE-3 achieves 6.5x speedup, approximately 1.4x better than EAGLE-2. Within the SGLang inference framework, throughput improves by 1.38x at batch=64.
Lookahead Decoding — No Draft Model
Fu et al. (2024) proposed a fundamentally different approach.
Based on Jacobi Iteration
No draft model or extra heads required — based on Jacobi iteration, multiple future token positions are guessed simultaneously, verified in parallel, and iteratively refined until convergence.
How it works:
- Initialize: Randomly guess tokens at positions
- Each step: Run the target model’s forward pass on all positions in parallel
- If a position’s prediction matches the guess, that position has “converged”
- Unconverged positions are replaced with new predictions, and iteration continues
- Typically converges in 2-3 iterations
Advantages: Zero extra parameters, zero training cost, plug-and-play with any model
Limitations: Speedup is typically lower than Medusa/Eagle (number of iterations is unpredictable), practically around 1.5-2x
Comparison Summary
| Method | Extra Params | Training Cost | Speedup | Use Case |
|---|---|---|---|---|
▶Draft-then-Verify Separate draft model Train draft model 2-3x With small model | ||||
▶Medusa Multiple lightweight heads Low 2-3x Quick deployment | ||||
▶MTP Built-in at training High (pretrain) 2-3x Train new model | ||||
▶Eagle Lightweight decoder Low 3-4x Max acceleration | ||||
▶Eagle-3 Lightweight draft model Low ~6.5x Max acceleration | ||||
▶Lookahead None Zero 1.5-2x Plug-and-play | ||||
Selection Guide
- Already have a paired small model -> Draft-then-Verify
- Training a new model from scratch -> MTP (build multi-head prediction into pretraining)
- Have an existing large model, want quick acceleration -> Eagle > Medusa > Lookahead
- Don’t want to train anything -> Lookahead (plug-and-play)
All methods guarantee distribution consistency through rejection sampling — the acceleration is lossless, affecting only speed, not quality. The core difference lies in the source and quality of the draft, which determines the acceptance rate and final speedup.