Execution, Sampling & Context Management
Updated 2026-04-15
Series context: This is article #7 (the final one) in the llama.cpp source code walkthrough series, covering four topics: Execution (Ch.11), Sampling (Ch.12), Speculative Decoding (Ch.13), and Context Management (Ch.14). With computation graph construction, scheduling, and memory allocation all in place, we finally enter the actual computation and output stage. If you haven’t read #6 Backend Scheduling, Op Fusion & Memory Allocation, it’s recommended to first understand how the graph is split and allocated.
Part A: Execution
The previous chapters covered how the graph is built, how backends are assigned, and how memory is allocated. Now everything is ready, and we enter the actual execution stage.
process_ubatch() Overview
process_ubatch() is the complete execution entry point for a single ubatch, chaining together all the previous steps:
Writing Input Data: set_inputs()
The input tensors created during graph construction only have shape information, with no actual data. set_inputs() is responsible for writing the ubatch data into these tensors:
// src/llama-graph.cpp
void llm_graph_result::set_inputs(const llama_ubatch * ubatch) {
for (auto & input : inputs) {
input->set_input(ubatch); // Each input type has its own write logic
}
}
Different input types have specialized handlers:
| Input Type | Written Content |
|---|---|
llm_graph_input_embd | Token ID array or embedding vectors |
llm_graph_input_pos | Position IDs (plain 1D or M-RoPE multi-dimensional positions) |
llm_graph_input_attn_kv | KV cache indices, attention mask, K rotation positions |
llm_graph_input_attn_no_cache | Full attention mask (encoder scenarios) |
llm_graph_input_out_ids | Token indices that need logits output |
The most complex among these is the attention mask construction — it needs to fill each element with 0 (allow attention) or -∞ (mask out) based on sequence IDs, causality, sliding window, and other conditions.
Computation Graph Execution: compute_splits()
graph_compute() ultimately calls ggml_backend_sched_graph_compute_async(), which iterates through all splits and executes them:
// ggml/src/ggml-backend.cpp (simplified)
for (int i = 0; i < sched->n_splits; i++) {
struct ggml_backend_sched_split * split = &sched->splits[i];
ggml_backend_t backend = sched->backends[split->backend_id];
// Step 1: Copy the inputs required by this split
for (int j = 0; j < split->n_inputs; j++) {
ggml_tensor * input = split->inputs[j];
ggml_tensor * input_cpy = tensor_copy(input, split->backend_id, cur_copy);
// MoE optimization: only copy activated experts (see below)
// Regular tensors: try async copy, fall back to sync copy on failure
ggml_backend_tensor_copy_async(input_backend, backend, input, input_cpy);
}
// Step 2: Execute this split's subgraph
ggml_backend_graph_compute_async(backend, &split->graph);
// Step 3: Record event (for the next split to wait on)
if (split->n_inputs > 0) {
ggml_backend_event_record(events[split->backend_id][cur_copy], backend);
}
}
Each split’s execution is asynchronous — computation is initiated without waiting for completion, using an event mechanism that lets downstream splits wait when needed. This is the foundation of pipeline parallelism.
MoE Expert On-Demand Loading
For MoE (Mixture of Experts) models, copying expert weights across backends is the main bottleneck — for example, Mixtral 8x7B has 8 experts per layer, but each token only uses 2 of them. Fully copying all 8 experts’ weights is extremely wasteful.
llama.cpp implements selective expert copying during split execution:
Implementation details (ggml/src/ggml-backend.cpp):
- Detects a
GGML_OP_MUL_MAT_IDoperation with weights in a CPU host buffer - Reads expert indices from the gate’s output tensor
- Uses a bitset to mark which experts are selected
- Groups consecutive activated experts, copying one group per
ggml_backend_tensor_set_async()call - Adds a small amount of padding (512 bytes) after each group to satisfy CUDA MMQ kernel alignment requirements
In an 8-choose-2 MoE model, this optimization can reduce CPU-to-GPU weight transfer by approximately 75%.
dtype Dispatch
When an operation in the computation graph (e.g., MUL_MAT) is actually executed, the correct compute kernel must be selected based on the tensor’s data type. For the CPU backend, this is done through the type_traits_cpu lookup table:
// ggml/src/ggml-cpu/ggml-cpu.c (simplified)
// Each quantization type registers its own vec_dot function and pairing type
type_traits_cpu[GGML_TYPE_Q4_0] = {
.vec_dot = ggml_vec_dot_q4_0_q8_0, // Q4_0 x Q8_0 dot product
.vec_dot_type = GGML_TYPE_Q8_0, // Input needs conversion to Q8_0
.from_float = quantize_row_q4_0,
};
type_traits_cpu[GGML_TYPE_Q4_K] = {
.vec_dot = ggml_vec_dot_q4_K_q8_K, // Q4_K x Q8_K
.vec_dot_type = GGML_TYPE_Q8_K,
.nrows = 2, // ARM MATMUL_INT8 can parallelize 2 rows
};
type_traits_cpu[GGML_TYPE_F16] = {
.vec_dot = ggml_vec_dot_f16, // F16 x F16
.vec_dot_type = GGML_TYPE_F16,
};
The dtype dispatch flow when executing MUL_MAT:
This is the actual execution path of the dot product pairing mentioned in #1 GGUF Binary Parsing: weights stay in their original quantized format (Q4_0), inputs are quantized at runtime to the pairing type (Q8_0), and then a specially optimized dot product kernel is called.
Prefill vs Decode Threading Differences
graph_compute() uses different threading strategies depending on whether it’s prefill (processing multiple tokens) or decode (generating a single token):
// src/llama-context.cpp
ggml_status llama_context::graph_compute(ggml_cgraph * gf, bool batched) {
int n_threads = batched ? cparams.n_threads_batch : cparams.n_threads;
ggml_threadpool_t tp = batched ? threadpool_batch : threadpool;
ggml_backend_cpu_set_threadpool(backend_cpu, tp);
return ggml_backend_sched_graph_compute_async(sched.get(), gf);
}
- Prefill (
batched=true): Usesn_threads_batch(typically more threads), because large-batch matrix operations can fully utilize multi-threaded parallelism - Decode (
batched=false): Usesn_threads(typically fewer threads), because single-token computation is small and too many threads add synchronization overhead
Part B: Sampling
The model forward pass outputs a logits vector of length n_vocab — the unnormalized log-probability for each token. Sampling’s task is to select the next token from this enormous probability distribution. llama.cpp uses a configurable sampler chain to accomplish this.
Default Sampler Chain Construction Order
common_sampler_init() (common/sampling.cpp) builds the sampler chain in the following order. This order is critical — each sampler modifies logits/probabilities before passing them to the next:
logit_bias -> penalties -> DRY -> top-n-sigma -> top-k -> typical-p -> top-p -> min-p -> XTC -> temperature -> dist
The default order is defined in common/common.h:
std::vector<enum common_sampler_type> samplers = {
COMMON_SAMPLER_TYPE_PENALTIES,
COMMON_SAMPLER_TYPE_DRY,
COMMON_SAMPLER_TYPE_TOP_N_SIGMA,
COMMON_SAMPLER_TYPE_TOP_K,
COMMON_SAMPLER_TYPE_TYPICAL_P,
COMMON_SAMPLER_TYPE_TOP_P,
COMMON_SAMPLER_TYPE_MIN_P,
COMMON_SAMPLER_TYPE_XTC,
COMMON_SAMPLER_TYPE_TEMPERATURE,
};
// dist (sample from distribution) is implicitly appended at the end
Here, logit_bias is fixed at the front (added separately outside the chain), and dist (final random sampling) is fixed at the end. Users can rearrange the middle portion’s order via the --samplers parameter.
Core Samplers Overview
| Sampler | Function | Key Parameter |
|---|---|---|
| logit_bias | Add/subtract logit bias for specified tokens | --logit-bias TOKEN+BIAS |
| penalties | Repetition/frequency/presence penalties | --repeat-penalty, --frequency-penalty, --presence-penalty |
| DRY | Detect repeating sequences, apply exponential penalty | --dry-multiplier, --dry-base |
| top-k | Keep only the k tokens with highest logits | --top-k |
| top-p | Truncate when cumulative probability reaches p | --top-p |
| min-p | Keep only tokens with probability >= p x max_prob | --min-p |
| temperature | Scale logits: logit /= temp | --temp |
| dist | Randomly sample from the final probability distribution | --seed |
The effect of temperature is intuitive: temp < 1 sharpens the distribution (more deterministic), temp > 1 flattens it (more random), and temp = 0 degenerates to greedy (pick the highest-probability token).
Interactive Demo
The component below shows how 5 core samplers in the chain (top-k -> top-p -> min-p -> temperature -> dist) progressively affect the probability distribution. The full chain also includes logit_bias, penalties, DRY, top-n-sigma, typical-p, XTC, and other samplers that work similarly but target more specialized scenarios — they are omitted here to highlight the core truncation and scaling logic:
llama.cpp Sampler Chain Visualizer
Watch logits pass through each sampler, getting filtered and transformed
Raw logits from the model — unnormalized scores for each token in the vocabulary. Here we show 12 representative tokens.
Logit values
Sampling Execution Flow
common_sampler_sample() is the entry function for sampling:
// common/sampling.cpp (simplified)
llama_token common_sampler_sample(common_sampler * gsmpl, llama_context * ctx, int idx,
bool grammar_first) {
// 1. Get logits from context
gsmpl->set_logits(ctx, idx);
// 2. Apply reasoning budget (if enabled)
llama_sampler_apply(rbudget, &cur_p);
// 3a. Grammar-first mode: apply grammar constraints first
if (grammar_first && grammar_should_apply(gsmpl)) {
llama_sampler_apply(grmr, &cur_p);
}
// 4. Execute the entire sampler chain
llama_sampler_apply(chain, &cur_p);
// 5. Get the selected token
id = cur_p.data[cur_p.selected].id;
// 6. Rejection sampling (non grammar-first mode)
if (!grammar_first && grammar_should_apply(gsmpl)) {
// Check if the selected token conforms to grammar
// If not -> resample (grammar first, then chain)
}
return id;
}
Grammar Constraints
Grammar sampling is one of llama.cpp’s most powerful features — it constrains the model to only output text conforming to a specified grammar.
How it works: The grammar maintains a set of parsing stacks, tracking the current position within a BNF grammar. For each candidate token, the grammar checks whether that token can advance at least one parsing stack. Tokens that cannot advance any stack have their logit set to -∞, ensuring they are never selected.
Two application strategies:
- Grammar-first: Filter then sample, guarantees success on the first try, but needs to check grammar for all tokens (slower)
- Rejection sampling (default): Sample then check, usually passes on the first attempt (faster), only falls back when it doesn’t
JSON Schema to BNF Conversion
The most common grammar use case in practice is constraining output to JSON. llama.cpp provides json_schema_to_grammar() (common/json-schema-to-grammar.cpp) to automatically convert a JSON Schema into BNF grammar rules:
JSON Schema: BNF Grammar:
{ root ::= "{" ws "name" ws ":" ws string
"type": "object", "," ws "age" ws ":" ws integer "}" ws
"properties": { string ::= "\"" [^"\\]* "\""
"name": { "type": "string" }, integer ::= [0-9]+
"age": { "type": "integer" } ws ::= [ \t\n]*
},
"required": ["name", "age"]
}
Lazy Grammar: In some scenarios, the grammar doesn’t need to take effect from the very first token — for example, the model might first output some chain-of-thought reasoning before generating structured output. Lazy grammar waits for a trigger pattern to appear before activating the constraint.
Output: Token to Text
After sampling yields a token ID, it needs to be converted back into human-readable text:
// common/common.cpp
std::string common_token_to_piece(const llama_vocab * vocab, llama_token token, bool special) {
std::string piece;
piece.resize(piece.capacity());
const int n_chars = llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
if (n_chars < 0) {
piece.resize(-n_chars); // Buffer too small, resize and retry
llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
} else {
piece.resize(n_chars);
}
return piece;
}
A single token may correspond to a complete word, a subword, or even half a UTF-8 character. During streaming output, care must be taken to handle incomplete multibyte characters — wait until a complete character is assembled before outputting.
Part C: Speculative Decoding
The bottleneck of autoregressive decoding is that only one token can be generated at a time, and each token requires a full forward pass. Speculative decoding breaks this bottleneck through a “small model drafts quickly, large model verifies at once” approach.
Core Idea
The key insight: the target model can verify N candidate tokens in a single forward pass (since their positions are known), while the draft model’s N forward passes are much faster than the target model’s. If most guesses are accepted, it’s equivalent to generating multiple tokens with a single target forward pass.
Draft-Verify Loop
The complete speculative decoding loop (examples/speculative-simple/speculative-simple.cpp):
while (true) {
// 1. Draft phase: small model generates N candidates
llama_tokens draft = common_speculative_draft(spec, params_spec, prompt_tgt, id_last);
// 2. Build target batch: [last token, candidate_0, candidate_1, ..., candidate_N-1]
common_batch_add(batch_tgt, id_last, n_past++, {0}, true);
for (size_t i = 0; i < draft.size(); ++i) {
common_batch_add(batch_tgt, draft[i], n_past + i, {0}, true);
}
// 3. Target model: one forward pass
llama_decode(ctx_tgt, batch_tgt);
// 4. Verify: compare target sampling results against draft one by one
const auto ids = common_sampler_sample_and_accept_n(smpl, ctx_tgt, draft);
// ids contains accepted tokens (at least 1, at most N+1)
// 5. Clean up unaccepted KV cache entries
llama_memory_seq_rm(llama_get_memory(ctx_tgt), 0, n_past + ids.size(), -1);
// 6. Output accepted tokens, update state
for (const auto & id : ids) {
output(id);
}
}
Verification Algorithm
Verification has two modes:
Greedy verification (temperature = 0): Directly compare target sampling results with draft tokens; reject on mismatch.
Stochastic verification (temperature > 0): Uses the standard speculative decoding acceptance test:
For each candidate token t_i:
r = random(0, 1)
if r <= p_target(t_i) / p_draft(t_i):
accept t_i
else:
reject t_i, resample from residual distribution:
p_residual(t) = max(0, p_target(t) - p_draft(t))
normalize and sample
This formula guarantees that regardless of draft model quality, the final output distribution is strictly equal to the target model’s distribution — speculative decoding does not change output quality, only generation speed.
Draft Strategies
llama.cpp supports multiple draft generation strategies:
| Strategy | Principle | Use Case |
|---|---|---|
| draft model | Independent small model generates candidates | Highest quality, requires extra model |
| eagle3 | EAGLE-3 speculative head (attached to target model) | Experimental |
| ngram-simple | Find matching n-gram patterns in context | Zero extra cost, good for repetitive text |
| ngram-map-k | Use hash map to index n-grams in context | More efficient lookup |
| ngram-map-k4v | Each key stores up to 4 different m-gram values | Higher acceptance rate |
| ngram-mod | Rolling hash lookup, shared memory pool (~16MB) | Constant complexity |
| ngram-cache | Three-level n-gram cache (context/dynamic/static) | Most flexible, can load external statistics |
Multiple strategies can be enabled simultaneously — they are tried in priority order, the first strategy that produces a non-empty draft is used, and the rest serve as fallback.
Draft Model Confidence Control
The draft model doesn’t blindly generate a fixed N tokens — it monitors each generated token’s confidence (top probability):
When generating candidate tokens:
if top_1_prob < p_min (default 0.75):
stop generating (confidence too low, continuing to guess only wastes time)
This adaptive mechanism lets the draft guess more when confident and stop early when uncertain.
Key Parameters
| Parameter | Default | Description |
|---|---|---|
--draft N | 16 | Maximum draft token count |
--draft-min N | 0 | Minimum draft token count |
--draft-p-min P | 0.75 | Draft confidence threshold (stop drafting below this) |
--model-draft PATH | — | GGUF file path for the draft model |
--gpu-layers-draft N | — | GPU layer count for the draft model |
Performance Gains
The speedup from speculative decoding depends on:
- Draft acceptance rate: The closer the draft model’s distribution is to the target model’s, the higher the acceptance rate
- Draft speed ratio: The faster the draft model (relative to target), the greater the benefit
- Draft length: An appropriate N balances draft cost against verification gains
In typical scenarios (Llama-3-8B target + Llama-3-1B draft), post-prefill decode speed can improve by 2-3x.
Part D: Context Management
A core constraint of LLM inference is the context window — the maximum number of tokens the model can “see.” The KV cache stores Key/Value vectors for all processed tokens and is the core data structure for context management.
KV Cache Internal Structure
Each Transformer layer’s KV cache consists of two 3D tensors:
K cache: [n_embd_k_gqa, kv_size, n_stream]
V cache: [n_embd_v_gqa, kv_size, n_stream]
Here kv_size is the cache capacity (total number of slots), and n_stream controls multi-sequence isolation.
The cache metadata is managed by llama_kv_cells:
// src/llama-kv-cells.h (core fields)
class llama_kv_cells {
vector<llama_pos> pos; // Position of each cell (-1 = free)
vector<llama_pos> shift; // Accumulated position offset (pending application)
vector<bitset<LLAMA_MAX_SEQ>> seq; // Which sequences each cell belongs to
set<uint32_t> used; // Index set of occupied cells
};
Each cell is a slot in the KV cache — it records the sequence membership and position information of the token stored at that location. During attention computation, the model uses each cell’s pos and seq information to determine which historical tokens the current token can “see.”
KV Cache Quantization
The KV cache uses f16 storage by default, but can be configured to quantized formats to save VRAM:
--cache-type-k TYPE K cache type (default f16)
--cache-type-v TYPE V cache type (default f16)
Supported types include f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1. Using q4_0 can reduce KV cache memory usage by approximately 75%, but introduces quantization error. Note: quantizing the V cache requires Flash Attention to be enabled.
Context Shift: Emergency Mechanism When Context Is Full
When the KV cache is full, no new tokens can be written. The simplest approach is to discard the entire cache and start over, but this would lose all context. Context shift is a compromise — discard the middle portion, keep the head and tail:
Implementation code (tools/completion/completion.cpp):
// 1. Remove KV cache in range [n_keep, n_keep + n_discard)
llama_memory_seq_rm(mem, 0, params.n_keep, params.n_keep + n_discard);
// 2. Shift positions of [n_keep + n_discard, n_past) backward by n_discard
llama_memory_seq_add(mem, 0, params.n_keep + n_discard, n_past, -n_discard);
n_keep is typically set to the length of the system prompt — ensuring the system prompt is never discarded. Information from the discarded middle tokens is permanently lost.
For models using RoPE positional encoding, shifting positions is not just a metadata change — it requires performing inverse-rotation and re-rotation (K-shift) on the RoPE encodings stored in the quantized K cache to ensure positional information remains correct.
Prompt Cache: Saving and Restoring State
If every conversation prefills the system prompt from scratch, that’s a significant waste. Prompt cache allows saving the KV cache state to a file and loading it directly next time:
// Save: KV cache + token sequence -> file
llama_state_save_file(ctx, "cache.bin", tokens, n_tokens);
// Restore: file -> KV cache + token sequence
llama_state_load_file(ctx, "cache.bin", tokens_out, capacity, &n_tokens_out);
The file format is binary: cell metadata (position, sequence IDs) and per-layer K/V tensor data are stored by stream. During restoration, data type and dimension compatibility are validated.
Typical usage: process the system prompt once, save the cache -> for each subsequent conversation, load the cache and start generating from the user input, skipping the system prompt’s prefill time.
Sequence Operation API
llama.cpp provides a flexible set of sequence operation APIs to manage the KV cache:
| API | Function |
|---|---|
seq_rm(seq, p0, p1) | Remove tokens in range [p0, p1) from a sequence |
seq_cp(src, dst, p0, p1) | Copy a sequence’s KV cache (for prefix sharing in parallel decoding) |
seq_keep(seq) | Keep only the specified sequence, clear everything else |
seq_add(seq, p0, p1, delta) | Position offset (the core operation of context shift) |
These operations only modify metadata (cell pos/seq information), without copying actual tensor data — making them very fast. The only exception is cross-stream seq_cp, which requires actually copying tensor data.
Summary
This article covered the final four stages of the llama.cpp inference pipeline:
| Stage | Core Mechanism | Key Code Entry Point |
|---|---|---|
| Execution | Per-split async computation + MoE selective copying + dtype dispatch | process_ubatch() / compute_splits() |
| Sampling | Configurable sampler chain + Grammar constraints + JSON Schema to BNF | common_sampler_sample() |
| Speculative Decoding | Small model draft -> large model single-pass verify -> acceptance rate control | common_speculative_draft() |
| Context Management | KV cache quantization + context shift + prompt cache + sequence operation API | llama_kv_cells / seq_rm() / seq_add() |
This concludes all 8 articles in the llama.cpp source code walkthrough series, covering the complete journey from GGUF file parsing to final text output. Return to the series overview to see the full map.