Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Sampling & Decoding — From Probabilities to Text

Sampling & Decoding — From Probabilities to Text

Updated 2026-04-06

A language model’s output is a probability distribution — predicted probabilities for every token in the vocabulary. But what we ultimately need is a concrete piece of text. The process of selecting actual tokens from a probability distribution is Sampling & Decoding.

Different sampling strategies produce vastly different text: deterministic Greedy is suitable for code generation, while high-temperature Top-p is ideal for creative writing. And Perplexity is the core metric for measuring model prediction quality — it tells us how “confused” the model is at each position.

Perplexity — A Language Model’s “Confusion”

Information Theory Foundations

Before understanding Perplexity, we need two information theory concepts:

Entropy: Measures the uncertainty of a random variable

H(p)=xp(x)log2p(x)H(p) = -\sum_{x} p(x) \log_2 p(x)

Cross-Entropy: Measures how well model qq approximates the true distribution pp

H(p,q)=xp(x)log2q(x)H(p, q) = -\sum_{x} p(x) \log_2 q(x)

The better the model, the lower the cross-entropy.

Perplexity Definition

PPL=2H(p,q)\text{PPL} = 2^{H(p,q)}

Intuition: On average, how many tokens the model has to choose from at each position. PPL = 1 means complete certainty (the model always knows the next token), while PPL = |V| (vocabulary size) means completely random guessing.

For a text sequence w1,w2,,wNw_1, w_2, \ldots, w_N, the practical computation formula is:

PPL=exp(1Ni=1Nlnp(wiw<i))\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \ln p(w_i | w_{<i})\right)
45%25%15%8%4%
The
45%
52%22%12%8%4%
cat
52%
60%18%10%7%3%
sat
60%
70%12%8%6%3%
on
70%
65%18%8%5%3%
the
65%
55%20%12%8%4%
mat
55%
PPL = 1.75
Good model: high-confidence predictions → low perplexity (on average, choose from ~2 tokens per step)

Limitations of PPL

  • Low PPL does not equal high generation quality: A model can achieve low PPL by conservatively predicting high-frequency words, yet the generated text may lack diversity and creativity
  • Different tokenizers are incomparable: PPL values depend on how the vocabulary segments text; BPE and SentencePiece PPL cannot be directly compared
  • Length normalization matters: Without normalization, longer sentences will have systematically higher PPL

So, once the model outputs a probability distribution, how should we select the next token?

Greedy Decoding

The simplest strategy — pick the highest-probability token at each step:

wt=argmaxwp(ww<t)w_t = \arg\max_{w} p(w | w_{<t})

Pros: Deterministic, fast, simple to implement

Problems:

  • Local optimum trap: Greedy choices don’t necessarily yield the globally optimal sequence
  • Degenerate repetition: Tends to produce repetitive loops like “the the the…”
  • Lack of diversity: The same input always produces the same output

Use cases: Classification, information extraction, and other tasks that don’t require creativity.

Temperature Scaling

Divide logits by a temperature parameter TT before softmax:

pi=exp(zi/T)jexp(zj/T)p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
  • T < 1: Distribution becomes sharper → more deterministic, approaches Greedy
  • T > 1: Distribution becomes flatter → more random, increases diversity
  • T → 0: Degenerates to Greedy; T → infinity: Degenerates to uniform distribution
T→0: GreedyT=1: StandardT→∞: Uniform
65.5%Paris← Greedy16.2%London8.0%Berlin4.4%Tokyo2.7%Rome1.6%Madrid0.8%Oslo0.5%Lima0.2%Baku0.1%Fiji
Entropy (H)
1.67 bits
Perplexity (2ᴴ)
3.19
Greedy Choice
Paris

T < 1 → Sharper distribution (high certainty); T > 1 → Flatter distribution (high diversity)

Relationship between Temperature and Perplexity: higher T → more dispersed selections → higher PPL of generated text.

Top-k Sampling

Fan et al. (2018) proposed keeping only the top kk most probable tokens, setting all other probabilities to 0, re-normalizing, and then sampling.

The dilemma of choosing k:

  • k too small → Misses reasonable options (“I ate a ___” — there are many possible foods)
  • k too large → Includes extremely low-probability noise tokens

The core problem: kk is fixed and cannot adapt to different contexts’ levels of certainty. A high-certainty context (“the capital of France is”) only needs to keep 1-2 tokens, while a low-certainty context needs to keep many more.

Top-p / Nucleus Sampling

Holtzman et al. (2020) proposed the solution — dynamic truncation:

Select the smallest set of tokens VpV_p such that the cumulative probability wVpp(w)p\sum_{w \in V_p} p(w) \geq p.

  • High-certainty context (“the capital of France is”) → Small set (1-2 tokens suffice)
  • Uncertain context (“I like to eat”) → Larger set (more tokens needed to reach the threshold)
Greedy
Paris
100.0%
London
Berlin
Tokyo
Rome
Madrid
Seoul
Oslo
Lima
Cairo
Baku
Kiev
Doha
Fiji
Laos
Keep 1 tokens
Pick highest probability only
Top-k (k=5)
Paris
34.4%
London
23.4%
Berlin
17.2%
Tokyo
14.1%
Rome
10.9%
Madrid
Seoul
Oslo
Lima
Cairo
Baku
Kiev
Doha
Fiji
Laos
Keep 5 tokens
Fixed top 5 tokens
Top-p (p=0.90)
Paris
23.9%
London
16.3%
Berlin
12.0%
Tokyo
9.8%
Rome
7.6%
Madrid
6.5%
Seoul
5.4%
Oslo
4.3%
Lima
4.3%
Cairo
3.3%
Baku
3.3%
Kiev
3.3%
Doha
Fiji
Laos
Keep 12 tokens
Dynamic 12 tokens (Σ≥0.9)

Top-p dynamically adjusts retained count based on distribution certainty — sharp distributions keep few, flat distributions keep many

Top-p is more adaptive than Top-k and works better in practice. Typical settings are p=0.90.95p = 0.9 \sim 0.95.

Unlike the sampling strategies above, Beam Search is a search algorithm. It maintains BB candidate sequences (beams), expanding all possible next tokens at each step and keeping the top BB sequences by total score:

score(w1:t)=i=1tlogp(wiw<i)\text{score}(w_{1:t}) = \sum_{i=1}^{t} \log p(w_i | w_{<i})

A length penalty is usually added to avoid biasing toward shorter sequences.

Initial: [START]
[START]

Start from initial token, prepare to expand top-2 candidates

Pros: Global search produces better results than Greedy

Cons: Computation is BB times that of Greedy; generated text tends to be “safe” (lacking surprise and creativity); not suitable for open-ended generation

Use cases: Machine translation, text summarization, and other tasks requiring high accuracy.

Repetition Penalty and Other Techniques

In practice, sampling strategies are usually not used in isolation:

  • Frequency Penalty: Reduce logits of already-generated tokens proportionally to their occurrence count
  • Presence Penalty: Subtract a fixed value from logits of all tokens that have appeared
  • Min-P Sampling: Set a minimum probability threshold; tokens below pmin×pmaxp_{min} \times p_{max} are filtered out

Example combination: Temperature + Top-p + Repetition Penalty together is the standard configuration for current LLM APIs (such as OpenAI and Anthropic).

Strategy Selection Guide

ScenarioRecommended StrategyParameter Reference
Code generationGreedy or low-temp Top-pT=0.2, p=0.9
Creative writingHigh-temp Top-pT=0.9, p=0.95
Translation/SummarizationBeam SearchB=4, length_penalty=0.6
DialogueMid-temp Top-p + RepetitionT=0.7, p=0.9, rep=1.1

Summary

  • Perplexity measures model prediction quality and is the core metric for evaluating language models
  • Greedy is simple but degenerates; suited for deterministic tasks
  • Temperature controls distribution sharpness and serves as the foundation for other strategies
  • Top-k uses fixed truncation — simple but not adaptive enough
  • Top-p uses dynamic truncation and is the most commonly used sampling strategy today
  • Beam Search performs global search, suited for precision tasks like translation
  • In practice, multiple strategies are typically combined