Execution, Sampling & Context Management

Series context: This is article #7 (the final one) in the llama.cpp source code walkthrough series, covering four topics: Execution (Ch.11), Sampling (Ch.12), Speculative Decoding (Ch.13), and Context Management (Ch.14). With computation graph construction, scheduling, and memory allocation all in place, we finally enter the actual computation and output stage. If you haven’t read #6 Backend Scheduling, Op Fusion & Memory Allocation, it’s recommended to first understand how the graph is split and allocated.

Part A: Execution

The previous chapters covered how the graph is built, how backends are assigned, and how memory is allocated. Now everything is ready, and we enter the actual execution stage.

process_ubatch() Overview

process_ubatch() is the complete execution entry point for a single ubatch, chaining together all the previous steps:

process_ubatch() Execution Flow

process_ubatch(ubatch)

mctx->apply()

Apply KV cache operations

can_reuse()?

model.build_graph()

sched_alloc_graph()

Allocate memory

res->set_inputs(&ubatch)

Write input data

graph_compute()

Execute computation graph

Return results

logits / embeddings

Writing Input Data: set_inputs()

The input tensors created during graph construction only have shape information, with no actual data. set_inputs() is responsible for writing the ubatch data into these tensors:

// src/llama-graph.cpp
void llm_graph_result::set_inputs(const llama_ubatch * ubatch) {
    for (auto & input : inputs) {
        input->set_input(ubatch);  // Each input type has its own write logic
    }
}

Different input types have specialized handlers:

Input Type	Written Content
`llm_graph_input_embd`	Token ID array or embedding vectors
`llm_graph_input_pos`	Position IDs (plain 1D or M-RoPE multi-dimensional positions)
`llm_graph_input_attn_kv`	KV cache indices, attention mask, K rotation positions
`llm_graph_input_attn_no_cache`	Full attention mask (encoder scenarios)
`llm_graph_input_out_ids`	Token indices that need logits output

The most complex among these is the attention mask construction — it needs to fill each element with 0 (allow attention) or -∞ (mask out) based on sequence IDs, causality, sliding window, and other conditions.

Computation Graph Execution: compute_splits()

graph_compute() ultimately calls ggml_backend_sched_graph_compute_async(), which iterates through all splits and executes them:

// ggml/src/ggml-backend.cpp (simplified)
for (int i = 0; i < sched->n_splits; i++) {
    struct ggml_backend_sched_split * split = &sched->splits[i];
    ggml_backend_t backend = sched->backends[split->backend_id];

    // Step 1: Copy the inputs required by this split
    for (int j = 0; j < split->n_inputs; j++) {
        ggml_tensor * input     = split->inputs[j];
        ggml_tensor * input_cpy = tensor_copy(input, split->backend_id, cur_copy);

        // MoE optimization: only copy activated experts (see below)
        // Regular tensors: try async copy, fall back to sync copy on failure
        ggml_backend_tensor_copy_async(input_backend, backend, input, input_cpy);
    }

    // Step 2: Execute this split's subgraph
    ggml_backend_graph_compute_async(backend, &split->graph);

    // Step 3: Record event (for the next split to wait on)
    if (split->n_inputs > 0) {
        ggml_backend_event_record(events[split->backend_id][cur_copy], backend);
    }
}

Each split’s execution is asynchronous — computation is initiated without waiting for completion, using an event mechanism that lets downstream splits wait when needed. This is the foundation of pipeline parallelism.

MoE Expert On-Demand Loading

For MoE (Mixture of Experts) models, copying expert weights across backends is the main bottleneck — for example, Mixtral 8x7B has 8 experts per layer, but each token only uses 2 of them. Fully copying all 8 experts’ weights is extremely wasteful.

llama.cpp implements selective expert copying during split execution:

MoE Expert On-Demand Loading

MoE layer input tensor

Weights for 8 experts

Read gate output

Determine expert IDs used by each token

Build bitmap

Mark activated experts

Group consecutive IDs

e.g. experts 2,3 -> one copy

Copy only activated experts

2/8 = 75% savings

Implementation details (ggml/src/ggml-backend.cpp):

Detects a GGML_OP_MUL_MAT_ID operation with weights in a CPU host buffer
Reads expert indices from the gate’s output tensor
Uses a bitset to mark which experts are selected
Groups consecutive activated experts, copying one group per ggml_backend_tensor_set_async() call
Adds a small amount of padding (512 bytes) after each group to satisfy CUDA MMQ kernel alignment requirements

In an 8-choose-2 MoE model, this optimization can reduce CPU-to-GPU weight transfer by approximately 75%.

dtype Dispatch

When an operation in the computation graph (e.g., MUL_MAT) is actually executed, the correct compute kernel must be selected based on the tensor’s data type. For the CPU backend, this is done through the type_traits_cpu lookup table:

// ggml/src/ggml-cpu/ggml-cpu.c (simplified)
// Each quantization type registers its own vec_dot function and pairing type
type_traits_cpu[GGML_TYPE_Q4_0] = {
    .vec_dot      = ggml_vec_dot_q4_0_q8_0,  // Q4_0 x Q8_0 dot product
    .vec_dot_type = GGML_TYPE_Q8_0,           // Input needs conversion to Q8_0
    .from_float   = quantize_row_q4_0,
};

type_traits_cpu[GGML_TYPE_Q4_K] = {
    .vec_dot      = ggml_vec_dot_q4_K_q8_K,  // Q4_K x Q8_K
    .vec_dot_type = GGML_TYPE_Q8_K,
    .nrows        = 2,                        // ARM MATMUL_INT8 can parallelize 2 rows
};

type_traits_cpu[GGML_TYPE_F16] = {
    .vec_dot      = ggml_vec_dot_f16,         // F16 x F16
    .vec_dot_type = GGML_TYPE_F16,
};

The dtype dispatch flow when executing MUL_MAT:

MUL_MAT dtype Dispatch

MUL_MAT

src0(Q4_0) x src1(F32)

Lookup type_traits_cpu[Q4_0]

vec_dot_type = Q8_0

src1: F32 -> Q8_0

Runtime quantization

ggml_vec_dot_q4_0_q8_0()

Q4_0 x Q8_0 = F32 result

This is the actual execution path of the dot product pairing mentioned in #1 GGUF Binary Parsing: weights stay in their original quantized format (Q4_0), inputs are quantized at runtime to the pairing type (Q8_0), and then a specially optimized dot product kernel is called.

Prefill vs Decode Threading Differences

graph_compute() uses different threading strategies depending on whether it’s prefill (processing multiple tokens) or decode (generating a single token):

// src/llama-context.cpp
ggml_status llama_context::graph_compute(ggml_cgraph * gf, bool batched) {
    int n_threads    = batched ? cparams.n_threads_batch : cparams.n_threads;
    ggml_threadpool_t tp = batched ? threadpool_batch     : threadpool;

    ggml_backend_cpu_set_threadpool(backend_cpu, tp);
    return ggml_backend_sched_graph_compute_async(sched.get(), gf);
}

Prefill (batched=true): Uses n_threads_batch (typically more threads), because large-batch matrix operations can fully utilize multi-threaded parallelism
Decode (batched=false): Uses n_threads (typically fewer threads), because single-token computation is small and too many threads add synchronization overhead

Part B: Sampling

The model forward pass outputs a logits vector of length n_vocab — the unnormalized log-probability for each token. Sampling’s task is to select the next token from this enormous probability distribution. llama.cpp uses a configurable sampler chain to accomplish this.

Default Sampler Chain Construction Order

common_sampler_init() (common/sampling.cpp) builds the sampler chain in the following order. This order is critical — each sampler modifies logits/probabilities before passing them to the next:

logit_bias -> penalties -> DRY -> top-n-sigma -> top-k -> typical-p -> top-p -> min-p -> XTC -> temperature -> dist

The default order is defined in common/common.h:

std::vector<enum common_sampler_type> samplers = {
    COMMON_SAMPLER_TYPE_PENALTIES,
    COMMON_SAMPLER_TYPE_DRY,
    COMMON_SAMPLER_TYPE_TOP_N_SIGMA,
    COMMON_SAMPLER_TYPE_TOP_K,
    COMMON_SAMPLER_TYPE_TYPICAL_P,
    COMMON_SAMPLER_TYPE_TOP_P,
    COMMON_SAMPLER_TYPE_MIN_P,
    COMMON_SAMPLER_TYPE_XTC,
    COMMON_SAMPLER_TYPE_TEMPERATURE,
};
// dist (sample from distribution) is implicitly appended at the end

Here, logit_bias is fixed at the front (added separately outside the chain), and dist (final random sampling) is fixed at the end. Users can rearrange the middle portion’s order via the --samplers parameter.

Core Samplers Overview

Sampler	Function	Key Parameter
logit_bias	Add/subtract logit bias for specified tokens	`--logit-bias TOKEN+BIAS`
penalties	Repetition/frequency/presence penalties	`--repeat-penalty`, `--frequency-penalty`, `--presence-penalty`
DRY	Detect repeating sequences, apply exponential penalty	`--dry-multiplier`, `--dry-base`
top-k	Keep only the k tokens with highest logits	`--top-k`
top-p	Truncate when cumulative probability reaches p	`--top-p`
min-p	Keep only tokens with probability >= p x max_prob	`--min-p`
temperature	Scale logits: `logit /= temp`	`--temp`
dist	Randomly sample from the final probability distribution	`--seed`

The effect of temperature is intuitive: temp < 1 sharpens the distribution (more deterministic), temp > 1 flattens it (more random), and temp = 0 degenerates to greedy (pick the highest-probability token).

Interactive Demo

The component below shows how 5 core samplers in the chain (top-k -> top-p -> min-p -> temperature -> dist) progressively affect the probability distribution. The full chain also includes logit_bias, penalties, DRY, top-n-sigma, typical-p, XTC, and other samplers that work similarly but target more specialized scenarios — they are omitted here to highlight the core truncation and scaling logic:

llama.cpp Sampler Chain Visualizer

Watch logits pass through each sampler, getting filtered and transformed

Step 1/6Initial Logits

Raw logits from the model — unnormalized scores for each token in the vocabulary. Here we show 12 representative tokens.

The

32.1%

Hello

21.5%

16.0%

10.7%

7.2%

4.8%

2.9%

2.0%

1.2%

0.8%

But

0.5%

0.4%

Logit values

The: 4.20Hello: 3.80I: 3.50It: 3.10We: 2.70,: 2.30.: 1.80A: 1.40In: 0.90": 0.50But: 0.10So: -0.30

Sampling Execution Flow

common_sampler_sample() is the entry function for sampling:

// common/sampling.cpp (simplified)
llama_token common_sampler_sample(common_sampler * gsmpl, llama_context * ctx, int idx,
                                   bool grammar_first) {
    // 1. Get logits from context
    gsmpl->set_logits(ctx, idx);

    // 2. Apply reasoning budget (if enabled)
    llama_sampler_apply(rbudget, &cur_p);

    // 3a. Grammar-first mode: apply grammar constraints first
    if (grammar_first && grammar_should_apply(gsmpl)) {
        llama_sampler_apply(grmr, &cur_p);
    }

    // 4. Execute the entire sampler chain
    llama_sampler_apply(chain, &cur_p);

    // 5. Get the selected token
    id = cur_p.data[cur_p.selected].id;

    // 6. Rejection sampling (non grammar-first mode)
    if (!grammar_first && grammar_should_apply(gsmpl)) {
        // Check if the selected token conforms to grammar
        // If not -> resample (grammar first, then chain)
    }

    return id;
}

Grammar Constraints

Grammar sampling is one of llama.cpp’s most powerful features — it constrains the model to only output text conforming to a specified grammar.

How it works: The grammar maintains a set of parsing stacks, tracking the current position within a BNF grammar. For each candidate token, the grammar checks whether that token can advance at least one parsing stack. Tokens that cannot advance any stack have their logit set to -∞, ensuring they are never selected.

Two application strategies:

Grammar: Two Application Strategies

Grammar-First

grammar_first=true

logits

Grammar constraint

Invalid tokens -> -inf

Sampler chain

top-k -> top-p -> temp -> dist

Selected token

Rejection Sampling

grammar_first=false (default)

logits

Sampler chain

top-k -> top-p -> temp -> dist

Conforms to grammar?

Resample

Grammar first, then chain

Grammar-first: Filter then sample, guarantees success on the first try, but needs to check grammar for all tokens (slower)
Rejection sampling (default): Sample then check, usually passes on the first attempt (faster), only falls back when it doesn’t

JSON Schema to BNF Conversion

The most common grammar use case in practice is constraining output to JSON. llama.cpp provides json_schema_to_grammar() (common/json-schema-to-grammar.cpp) to automatically convert a JSON Schema into BNF grammar rules:

JSON Schema:                              BNF Grammar:
{                                         root ::= "{" ws "name" ws ":" ws string
  "type": "object",                              "," ws "age" ws ":" ws integer "}" ws
  "properties": {                         string ::= "\"" [^"\\]* "\""
    "name": { "type": "string" },         integer ::= [0-9]+
    "age": { "type": "integer" }          ws ::= [ \t\n]*
  },
  "required": ["name", "age"]
}

Lazy Grammar: In some scenarios, the grammar doesn’t need to take effect from the very first token — for example, the model might first output some chain-of-thought reasoning before generating structured output. Lazy grammar waits for a trigger pattern to appear before activating the constraint.

Output: Token to Text

After sampling yields a token ID, it needs to be converted back into human-readable text:

// common/common.cpp
std::string common_token_to_piece(const llama_vocab * vocab, llama_token token, bool special) {
    std::string piece;
    piece.resize(piece.capacity());

    const int n_chars = llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);

    if (n_chars < 0) {
        piece.resize(-n_chars);  // Buffer too small, resize and retry
        llama_token_to_piece(vocab, token, &piece[0], piece.size(), 0, special);
    } else {
        piece.resize(n_chars);
    }
    return piece;
}

A single token may correspond to a complete word, a subword, or even half a UTF-8 character. During streaming output, care must be taken to handle incomplete multibyte characters — wait until a complete character is assembled before outputting.

Part C: Speculative Decoding

The bottleneck of autoregressive decoding is that only one token can be generated at a time, and each token requires a full forward pass. Speculative decoding breaks this bottleneck through a “small model drafts quickly, large model verifies at once” approach.

Core Idea

The key insight: the target model can verify N candidate tokens in a single forward pass (since their positions are known), while the draft model’s N forward passes are much faster than the target model’s. If most guesses are accepted, it’s equivalent to generating multiple tokens with a single target forward pass.

Speculative Decoding Flow

Draft Phase

Draft Model (small/fast)

Sequentially generate N candidate tokens

Candidate sequence

[t0, t1, ..., tN-1]

Verify Phase

Target Model (large/slow)

Evaluate all candidates in one forward pass

Accept first k + 1 new token

Draft-Verify Loop

The complete speculative decoding loop (examples/speculative-simple/speculative-simple.cpp):

while (true) {
    // 1. Draft phase: small model generates N candidates
    llama_tokens draft = common_speculative_draft(spec, params_spec, prompt_tgt, id_last);

    // 2. Build target batch: [last token, candidate_0, candidate_1, ..., candidate_N-1]
    common_batch_add(batch_tgt, id_last, n_past++, {0}, true);
    for (size_t i = 0; i < draft.size(); ++i) {
        common_batch_add(batch_tgt, draft[i], n_past + i, {0}, true);
    }

    // 3. Target model: one forward pass
    llama_decode(ctx_tgt, batch_tgt);

    // 4. Verify: compare target sampling results against draft one by one
    const auto ids = common_sampler_sample_and_accept_n(smpl, ctx_tgt, draft);
    // ids contains accepted tokens (at least 1, at most N+1)

    // 5. Clean up unaccepted KV cache entries
    llama_memory_seq_rm(llama_get_memory(ctx_tgt), 0, n_past + ids.size(), -1);

    // 6. Output accepted tokens, update state
    for (const auto & id : ids) {
        output(id);
    }
}

Verification Algorithm

Verification has two modes:

Greedy verification (temperature = 0): Directly compare target sampling results with draft tokens; reject on mismatch.

Stochastic verification (temperature > 0): Uses the standard speculative decoding acceptance test:

For each candidate token t_i:
    r = random(0, 1)
    if r <= p_target(t_i) / p_draft(t_i):
        accept t_i
    else:
        reject t_i, resample from residual distribution:
        p_residual(t) = max(0, p_target(t) - p_draft(t))
        normalize and sample

This formula guarantees that regardless of draft model quality, the final output distribution is strictly equal to the target model’s distribution — speculative decoding does not change output quality, only generation speed.

Draft Strategies

llama.cpp supports multiple draft generation strategies:

Strategy	Principle	Use Case
draft model	Independent small model generates candidates	Highest quality, requires extra model
eagle3	EAGLE-3 speculative head (attached to target model)	Experimental
ngram-simple	Find matching n-gram patterns in context	Zero extra cost, good for repetitive text
ngram-map-k	Use hash map to index n-grams in context	More efficient lookup
ngram-map-k4v	Each key stores up to 4 different m-gram values	Higher acceptance rate
ngram-mod	Rolling hash lookup, shared memory pool (~16MB)	Constant complexity
ngram-cache	Three-level n-gram cache (context/dynamic/static)	Most flexible, can load external statistics

Multiple strategies can be enabled simultaneously — they are tried in priority order, the first strategy that produces a non-empty draft is used, and the rest serve as fallback.

Draft Model Confidence Control

The draft model doesn’t blindly generate a fixed N tokens — it monitors each generated token’s confidence (top probability):

When generating candidate tokens:
    if top_1_prob < p_min (default 0.75):
        stop generating (confidence too low, continuing to guess only wastes time)

This adaptive mechanism lets the draft guess more when confident and stop early when uncertain.

Key Parameters

Parameter	Default	Description
`--draft N`	16	Maximum draft token count
`--draft-min N`	0	Minimum draft token count
`--draft-p-min P`	0.75	Draft confidence threshold (stop drafting below this)
`--model-draft PATH`	—	GGUF file path for the draft model
`--gpu-layers-draft N`	—	GPU layer count for the draft model

Performance Gains

The speedup from speculative decoding depends on:

Draft acceptance rate: The closer the draft model’s distribution is to the target model’s, the higher the acceptance rate
Draft speed ratio: The faster the draft model (relative to target), the greater the benefit
Draft length: An appropriate N balances draft cost against verification gains

In typical scenarios (Llama-3-8B target + Llama-3-1B draft), post-prefill decode speed can improve by 2-3x.

Part D: Context Management

A core constraint of LLM inference is the context window — the maximum number of tokens the model can “see.” The KV cache stores Key/Value vectors for all processed tokens and is the core data structure for context management.

KV Cache Internal Structure

Each Transformer layer’s KV cache consists of two 3D tensors:

K cache: [n_embd_k_gqa, kv_size, n_stream]
V cache: [n_embd_v_gqa, kv_size, n_stream]

Here kv_size is the cache capacity (total number of slots), and n_stream controls multi-sequence isolation.

The cache metadata is managed by llama_kv_cells:

// src/llama-kv-cells.h (core fields)
class llama_kv_cells {
    vector<llama_pos> pos;           // Position of each cell (-1 = free)
    vector<llama_pos> shift;         // Accumulated position offset (pending application)
    vector<bitset<LLAMA_MAX_SEQ>> seq;  // Which sequences each cell belongs to
    set<uint32_t> used;              // Index set of occupied cells
};

Each cell is a slot in the KV cache — it records the sequence membership and position information of the token stored at that location. During attention computation, the model uses each cell’s pos and seq information to determine which historical tokens the current token can “see.”

KV Cache Quantization

The KV cache uses f16 storage by default, but can be configured to quantized formats to save VRAM:

--cache-type-k TYPE    K cache type (default f16)
--cache-type-v TYPE    V cache type (default f16)

Supported types include f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1. Using q4_0 can reduce KV cache memory usage by approximately 75%, but introduces quantization error. Note: quantizing the V cache requires Flash Attention to be enabled.

Context Shift: Emergency Mechanism When Context Is Full

When the KV cache is full, no new tokens can be written. The simplest approach is to discard the entire cache and start over, but this would lose all context. Context shift is a compromise — discard the middle portion, keep the head and tail:

Context Shift Mechanism

Before Context Shift

KV cache is full

n_keep initial tokens

Retained

n_discard middle tokens

Discarded

Remaining tokens

Retained + shifted forward

After Context Shift

Space freed up

n_keep initial tokens

Remaining tokens

Positions shifted by n_discard

Free space

Available for new tokens

Implementation code (tools/completion/completion.cpp):

// 1. Remove KV cache in range [n_keep, n_keep + n_discard)
llama_memory_seq_rm(mem, 0, params.n_keep, params.n_keep + n_discard);

// 2. Shift positions of [n_keep + n_discard, n_past) backward by n_discard
llama_memory_seq_add(mem, 0, params.n_keep + n_discard, n_past, -n_discard);

n_keep is typically set to the length of the system prompt — ensuring the system prompt is never discarded. Information from the discarded middle tokens is permanently lost.

For models using RoPE positional encoding, shifting positions is not just a metadata change — it requires performing inverse-rotation and re-rotation (K-shift) on the RoPE encodings stored in the quantized K cache to ensure positional information remains correct.

Prompt Cache: Saving and Restoring State

If every conversation prefills the system prompt from scratch, that’s a significant waste. Prompt cache allows saving the KV cache state to a file and loading it directly next time:

// Save: KV cache + token sequence -> file
llama_state_save_file(ctx, "cache.bin", tokens, n_tokens);

// Restore: file -> KV cache + token sequence
llama_state_load_file(ctx, "cache.bin", tokens_out, capacity, &n_tokens_out);

The file format is binary: cell metadata (position, sequence IDs) and per-layer K/V tensor data are stored by stream. During restoration, data type and dimension compatibility are validated.

Typical usage: process the system prompt once, save the cache -> for each subsequent conversation, load the cache and start generating from the user input, skipping the system prompt’s prefill time.

Sequence Operation API

llama.cpp provides a flexible set of sequence operation APIs to manage the KV cache:

API	Function
`seq_rm(seq, p0, p1)`	Remove tokens in range [p0, p1) from a sequence
`seq_cp(src, dst, p0, p1)`	Copy a sequence’s KV cache (for prefix sharing in parallel decoding)
`seq_keep(seq)`	Keep only the specified sequence, clear everything else
`seq_add(seq, p0, p1, delta)`	Position offset (the core operation of context shift)

These operations only modify metadata (cell pos/seq information), without copying actual tensor data — making them very fast. The only exception is cross-stream seq_cp, which requires actually copying tensor data.

Summary

This article covered the final four stages of the llama.cpp inference pipeline:

Stage	Core Mechanism	Key Code Entry Point
Execution	Per-split async computation + MoE selective copying + dtype dispatch	`process_ubatch()` / `compute_splits()`
Sampling	Configurable sampler chain + Grammar constraints + JSON Schema to BNF	`common_sampler_sample()`
Speculative Decoding	Small model draft -> large model single-pass verify -> acceptance rate control	`common_speculative_draft()`
Context Management	KV cache quantization + context shift + prompt cache + sequence operation API	`llama_kv_cells` / `seq_rm()` / `seq_add()`

This concludes all 8 articles in the llama.cpp source code walkthrough series, covering the complete journey from GGUF file parsing to final text output. Return to the series overview to see the full map.