Batch, Ubatch & the Decoding Main Loop
Updated 2026-04-15
Series context: This is article #4 in the llama.cpp source code deep-dive series, covering how tokens are organized from user submission to actual computation — batch, ubatch, and the decode main loop. If you haven’t read the series overview and #3 Warmup, Tokenization & Chat Template, we recommend building the big picture first before diving into this chapter.
After tokenization, we have a token sequence. But before feeding it into the model for computation, these tokens need to be organized into batches. llama.cpp employs a two-level batching mechanism: a user-submitted logical batch and an internal micro-batch (ubatch) used for actual computation.
Part A: llama_batch — The User-Facing Interface
llama_batch is the core data structure for user interaction with llama.cpp, defined in the public header include/llama.h:
typedef struct llama_batch {
int32_t n_tokens;
llama_token * token; // token ID array
float * embd; // or provide embeddings directly (mutually exclusive with token)
llama_pos * pos; // position of each token in its sequence
int32_t * n_seq_id; // number of sequences each token belongs to
llama_seq_id ** seq_id; // list of sequence IDs for each token
int8_t * logits; // which tokens need logits output
} llama_batch;
Field descriptions:
| Field | Description |
|---|---|
token | Token ID array, size n_tokens. Mutually exclusive with embd |
embd | Provide embedding vectors directly, size n_tokens * n_embd. Used for external encoder scenarios |
pos | Position of each token within its sequence. Can be NULL (auto-increment) |
n_seq_id / seq_id | Each token can belong to multiple sequences simultaneously, enabling prefix sharing |
logits | Marks which tokens need logits computed (sampling only needs the last token) |
llama.cpp provides two convenience functions for creating batches:
// Simplest usage: a contiguous token sequence, auto-fills pos and seq_id
struct llama_batch llama_batch_get_one(llama_token * tokens, int32_t n_tokens);
// Allocate an empty batch that can hold n_tokens, requires manual population
struct llama_batch llama_batch_init(int32_t n_tokens, int32_t embd, int32_t n_seq_max);
llama_batch_get_one() suits the simplest scenario: given a set of contiguous tokens, it auto-fills position and sequence information. llama_batch_init() allocates an empty batch for the caller to populate manually — essential for advanced scenarios like parallel sequence decoding.
Part B: llama_ubatch — The Internal Micro-Batch
A user-submitted batch can be very large (e.g., a 2048-token prompt) and cannot be sent to the GPU all at once. Internally, llama.cpp splits it into smaller ubatches (micro-batches), defined in src/llama-batch.h:
struct llama_ubatch {
uint32_t n_tokens; // total tokens = n_seq_tokens * n_seqs
uint32_t n_seq_tokens; // tokens per sequence set
uint32_t n_seqs; // number of sequence sets
uint32_t n_seqs_unq; // number of unique sequence IDs
llama_token * token; // [n_tokens]
llama_pos * pos; // [n_tokens]
int32_t * n_seq_id; // [n_tokens]
llama_seq_id ** seq_id; // [n_tokens]
llama_seq_id * seq_id_unq; // [n_seqs_unq] unique sequence IDs
int32_t * seq_idx; // sequence ID -> index mapping within the ubatch
int8_t * output; // [n_tokens] output flags
// ...
};
The key difference between ubatch and batch is that ubatch tracks sequence set structural information. n_seq_tokens and n_seqs explicitly define the “how many sequences, how many tokens per sequence” matrix shape, which is critical for subsequent attention computation and KV cache management.
Two-Level Batch Parameters
These two levels are controlled by two command-line parameters:
--batch-size N Logical batch size (default 2048) — max tokens a user can submit at once
--ubatch-size N Physical batch size (default 512) — max tokens the GPU processes at once
They correspond to n_batch and n_ubatch in llama_context_params:
// common/common.h defaults
int32_t n_batch = 2048; // logical max batch
int32_t n_ubatch = 512; // physical max batch
Key constraint: n_ubatch <= n_batch. n_batch determines how many new tokens the KV cache can accept at once, while n_ubatch determines the GPU’s actual computation granularity. Reducing n_ubatch lowers peak GPU memory usage but increases the number of splits (more ubatches = more kernel launch overhead).
Part C: Three Splitting Algorithms
The llama_batch_allocr class is responsible for splitting a batch into ubatches. It provides three splitting strategies:
1. split_simple(n_ubatch) — Sequential Splitting
The most straightforward approach: take tokens sequentially from start to end, up to n_ubatch tokens at a time. Ignores sequence boundaries.
Use case: single-sequence prompt prefill. For example, a 1000-token prompt with n_ubatch=512 splits into two ubatches (512 + 488).
2. split_equal(n_ubatch, sequential) — Equal-Length Splitting
Ensures that each sequence set within the same ubatch has an equal number of tokens. The algorithm first selects non-overlapping sequence sets, then takes the same number of tokens from each set until the total approaches the n_ubatch limit or a sequence is exhausted.
Use case: multi-sequence parallel prefill, where balanced progress across sequences is needed.
3. split_seq(n_ubatch) — Per-Sequence Splitting
Each ubatch contains tokens from only one sequence set. Different sequences are never mixed into the same ubatch.
Use case: autoregressive decoding phase, where each sequence generates independently.
Strategy Selection
The choice of splitting strategy occurs in the KV cache’s init_batch():
// src/llama-kv-cache.cpp
llama_memory_context_ptr llama_kv_cache::init_batch(
llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) {
balloc.split_reset();
std::vector<llama_ubatch> ubatches;
while (true) {
// Single-stream mode uses split_simple, multi-stream uses split_equal
auto ubatch = n_stream == 1
? balloc.split_simple(n_ubatch)
: balloc.split_equal(n_ubatch, true);
if (ubatch.n_tokens == 0) break;
ubatches.push_back(std::move(ubatch));
}
// Reserve KV cache slots for each ubatch
auto sinfos = prepare(ubatches);
// ...
}
The logic is clean: single-sequence streams (n_stream == 1) use split_simple, multi-sequence streams use split_equal. split_seq is used in other specific code paths.
Interactive Demo
Adjust the parameters below to see how the three splitting algorithms behave under different configurations:
Batch Split Algorithm Visualization
Part D: Prompt Prefill Chunking Example
Suppose the user submits a 1500-token prompt with n_batch=2048 and n_ubatch=512:
Each ubatch independently goes through the full build graph, alloc, and compute flow, but they share the same KV cache — KV data written by earlier ubatches is visible to the attention computation of later ubatches.
Part E: Parallel Sequence Decoding
The seq_id mechanism allows multiple sequences to share a prompt prefix and then decode independently. This is the core principle behind the --parallel N parameter.
seq_id Prefix Sharing Mechanism
During the prefill phase, a single token can belong to multiple sequences simultaneously. For example, marking the tokens of “Hello world” with seq_id=[{0,1,2}] means all three sequences share the same KV cache data. During the decode phase, each sequence generates different tokens and writes to its own independent KV slots.
Prefill vs Decode Phase Comparison
Prefill (shared prompt): Add tokens “Hello world”, pos=[0,1], seq_id=[{0,1,2}]. All 3 sequences share the same KV entries.
Decode (independent generation):
- Sequence 0 generates “foo”,
pos=2,seq_id=[{0}] - Sequence 1 generates “bar”,
pos=2,seq_id=[{1}] - Sequence 2 generates “baz”,
pos=2,seq_id=[{2}]
Each sequence writes to its own independent KV slot.
batched.cpp Code Example
The concrete code flow (referencing examples/batched/batched.cpp):
// Prefill phase: all sequences share the same prompt
std::vector<llama_seq_id> all_seqs = {0, 1, 2};
for (size_t i = 0; i < prompt_tokens.size(); ++i) {
common_batch_add(batch, prompt_tokens[i], i, all_seqs, false);
}
// The last token needs logits output (for the first sampling step)
batch.logits[batch.n_tokens - 1] = true;
llama_decode(ctx, batch);
// Decode phase: each sequence generates independently
while (n_cur <= n_predict) {
common_batch_clear(batch);
for (int i = 0; i < n_parallel; ++i) {
llama_token new_token = llama_sampler_sample(smpl, ctx, i);
common_batch_add(batch, new_token, n_cur, { i }, true);
}
llama_decode(ctx, batch);
n_cur++;
}
Key point: during prefill, a single token belongs to sequences {0, 1, 2} — the KV cache stores only one copy, but all three sequences can access it. During decoding, each sequence generates different tokens and writes to its own independent KV slot.
Part F: The llama_decode Main Loop
With the batch/ubatch structure understood, let’s see how llama_decode() drives the entire flow:
// src/llama-context.cpp (simplified)
int llama_context::decode(const llama_batch & batch_inp) {
// 1. Validate and auto-fill missing fields (pos, seq_id, logits)
balloc->init(batch_inp, vocab, memory.get(), ...);
// 2. Have the memory module split the batch into ubatches and reserve KV cache slots
mctx = memory->init_batch(*balloc, cparams.n_ubatch, output_all);
// If KV cache is full, attempt optimization (defragmentation) and retry
// If still fails, return error code 1 (prompting the caller to do context shift)
// 3. Pre-allocate output buffer
output_reserve(n_outputs_all);
// 4. Process each ubatch
do {
const auto & ubatch = mctx->get_ubatch();
// Build compute graph -> allocate intermediate tensors -> set inputs -> execute computation
const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx, status);
// Extract logits and embeddings to the output buffer
ggml_backend_tensor_get_async(..., res->get_logits(), ...);
n_outputs_prev += n_outputs;
} while (mctx->next_ubatch()); // next ubatch
return 0;
}
The process_ubatch() here is the complete execution flow for a single micro-batch — it encompasses build graph, alloc, set inputs, and compute steps. We will cover these in detail in subsequent chapters.
Summary
The design goal of the two-level batching mechanism is clear:
| Level | Parameter | Responsibility |
|---|---|---|
| Batch | --batch-size | User-facing logical unit — determines how many tokens can be submitted at once |
| Ubatch | --ubatch-size | Hardware-facing physical unit — determines how many tokens the GPU processes at once |
Through the combination of seq_id and splitting strategies, llama.cpp achieves flexible batch processing: single-sequence prefill uses split_simple for sequential chunking, multi-sequence parallel processing uses split_equal for balanced allocation, and independent decoding uses split_seq for per-sequence isolation. This mechanism keeps the interface simple while giving the internal scheduler sufficient flexibility.
Next up: #5 Compute Graph Construction & Architecture Dispatch will dive into how llama.cpp builds compute graphs for 125 different architectures, and how the graph reuse mechanism avoids redundant construction.