Batch, Ubatch & the Decoding Main Loop

Series context: This is article #4 in the llama.cpp source code deep-dive series, covering how tokens are organized from user submission to actual computation — batch, ubatch, and the decode main loop. If you haven’t read the series overview and #3 Warmup, Tokenization & Chat Template, we recommend building the big picture first before diving into this chapter.

After tokenization, we have a token sequence. But before feeding it into the model for computation, these tokens need to be organized into batches. llama.cpp employs a two-level batching mechanism: a user-submitted logical batch and an internal micro-batch (ubatch) used for actual computation.

Part A: llama_batch — The User-Facing Interface

llama_batch is the core data structure for user interaction with llama.cpp, defined in the public header include/llama.h:

typedef struct llama_batch {
    int32_t n_tokens;

    llama_token  *  token;    // token ID array
    float        *  embd;     // or provide embeddings directly (mutually exclusive with token)
    llama_pos    *  pos;      // position of each token in its sequence
    int32_t      *  n_seq_id; // number of sequences each token belongs to
    llama_seq_id ** seq_id;   // list of sequence IDs for each token
    int8_t       *  logits;   // which tokens need logits output
} llama_batch;

Field descriptions:

Field	Description
`token`	Token ID array, size `n_tokens`. Mutually exclusive with `embd`
`embd`	Provide embedding vectors directly, size `n_tokens * n_embd`. Used for external encoder scenarios
`pos`	Position of each token within its sequence. Can be NULL (auto-increment)
`n_seq_id` / `seq_id`	Each token can belong to multiple sequences simultaneously, enabling prefix sharing
`logits`	Marks which tokens need logits computed (sampling only needs the last token)

llama.cpp provides two convenience functions for creating batches:

// Simplest usage: a contiguous token sequence, auto-fills pos and seq_id
struct llama_batch llama_batch_get_one(llama_token * tokens, int32_t n_tokens);

// Allocate an empty batch that can hold n_tokens, requires manual population
struct llama_batch llama_batch_init(int32_t n_tokens, int32_t embd, int32_t n_seq_max);

llama_batch_get_one() suits the simplest scenario: given a set of contiguous tokens, it auto-fills position and sequence information. llama_batch_init() allocates an empty batch for the caller to populate manually — essential for advanced scenarios like parallel sequence decoding.

Part B: llama_ubatch — The Internal Micro-Batch

A user-submitted batch can be very large (e.g., a 2048-token prompt) and cannot be sent to the GPU all at once. Internally, llama.cpp splits it into smaller ubatches (micro-batches), defined in src/llama-batch.h:

struct llama_ubatch {
    uint32_t n_tokens;     // total tokens = n_seq_tokens * n_seqs
    uint32_t n_seq_tokens; // tokens per sequence set
    uint32_t n_seqs;       // number of sequence sets
    uint32_t n_seqs_unq;   // number of unique sequence IDs

    llama_token  *  token;      // [n_tokens]
    llama_pos    *  pos;        // [n_tokens]
    int32_t      *  n_seq_id;   // [n_tokens]
    llama_seq_id ** seq_id;     // [n_tokens]
    llama_seq_id *  seq_id_unq; // [n_seqs_unq] unique sequence IDs
    int32_t      *  seq_idx;    // sequence ID -> index mapping within the ubatch
    int8_t       *  output;     // [n_tokens] output flags
    // ...
};

The key difference between ubatch and batch is that ubatch tracks sequence set structural information. n_seq_tokens and n_seqs explicitly define the “how many sequences, how many tokens per sequence” matrix shape, which is critical for subsequent attention computation and KV cache management.

Two-Level Batch Parameters

These two levels are controlled by two command-line parameters:

--batch-size  N   Logical batch size (default 2048) — max tokens a user can submit at once
--ubatch-size N   Physical batch size (default 512)  — max tokens the GPU processes at once

They correspond to n_batch and n_ubatch in llama_context_params:

// common/common.h defaults
int32_t n_batch  = 2048;  // logical max batch
int32_t n_ubatch =  512;  // physical max batch

Key constraint: n_ubatch <= n_batch. n_batch determines how many new tokens the KV cache can accept at once, while n_ubatch determines the GPU’s actual computation granularity. Reducing n_ubatch lowers peak GPU memory usage but increases the number of splits (more ubatches = more kernel launch overhead).

Part C: Three Splitting Algorithms

The llama_batch_allocr class is responsible for splitting a batch into ubatches. It provides three splitting strategies:

Batch to ubatch splitting paths

llama_batch

User-submitted

llama_batch_allocr

split_simple()

split_equal()

split_seq()

ubatch 0, 1, 2 ...

1. `split_simple(n_ubatch)` — Sequential Splitting

The most straightforward approach: take tokens sequentially from start to end, up to n_ubatch tokens at a time. Ignores sequence boundaries.

Use case: single-sequence prompt prefill. For example, a 1000-token prompt with n_ubatch=512 splits into two ubatches (512 + 488).

2. `split_equal(n_ubatch, sequential)` — Equal-Length Splitting

Ensures that each sequence set within the same ubatch has an equal number of tokens. The algorithm first selects non-overlapping sequence sets, then takes the same number of tokens from each set until the total approaches the n_ubatch limit or a sequence is exhausted.

Use case: multi-sequence parallel prefill, where balanced progress across sequences is needed.

3. `split_seq(n_ubatch)` — Per-Sequence Splitting

Each ubatch contains tokens from only one sequence set. Different sequences are never mixed into the same ubatch.

Use case: autoregressive decoding phase, where each sequence generates independently.

Strategy Selection

The choice of splitting strategy occurs in the KV cache’s init_batch():

// src/llama-kv-cache.cpp
llama_memory_context_ptr llama_kv_cache::init_batch(
        llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) {
    balloc.split_reset();

    std::vector<llama_ubatch> ubatches;
    while (true) {
        // Single-stream mode uses split_simple, multi-stream uses split_equal
        auto ubatch = n_stream == 1
            ? balloc.split_simple(n_ubatch)
            : balloc.split_equal(n_ubatch, true);

        if (ubatch.n_tokens == 0) break;
        ubatches.push_back(std::move(ubatch));
    }

    // Reserve KV cache slots for each ubatch
    auto sinfos = prepare(ubatches);
    // ...
}

The logic is clean: single-sequence streams (n_stream == 1) use split_simple, multi-sequence streams use split_equal. split_seq is used in other specific code paths.

Interactive Demo

Adjust the parameters below to see how the three splitting algorithms behave under different configurations:

Batch Split Algorithm Visualization

Total tokens (n_tokens): 1500

Ubatch limit (n_ubatch): 512

Sequences (n_sequences): 1

Sequential split: take tokens front-to-back, ignoring sequence boundaries

Total: 3 ubatches

ubatch 0512 tokens (Seq0: 512)

+452

ubatch 1512 tokens (Seq0: 512)

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

+452

ubatch 2476 tokens (Seq0: 476)

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

+416

Part D: Prompt Prefill Chunking Example

Suppose the user submits a 1500-token prompt with n_batch=2048 and n_ubatch=512:

Prefill chunking flow for a 1500-token prompt

User prompt

1500 tokens

llama_batch

n_tokens=1500

ubatch 0

tokens [0..511]

ubatch 1

tokens [512..1023]

ubatch 2

tokens [1024..1499]

process_ubatch()

KV cache

pos 0-511

KV cache

pos 512-1023

KV cache

pos 1024-1499

Each ubatch independently goes through the full build graph, alloc, and compute flow, but they share the same KV cache — KV data written by earlier ubatches is visible to the attention computation of later ubatches.

Part E: Parallel Sequence Decoding

The seq_id mechanism allows multiple sequences to share a prompt prefix and then decode independently. This is the core principle behind the --parallel N parameter.

During the prefill phase, a single token can belong to multiple sequences simultaneously. For example, marking the tokens of “Hello world” with seq_id=[{0,1,2}] means all three sequences share the same KV cache data. During the decode phase, each sequence generates different tokens and writes to its own independent KV slots.

Prefill vs Decode Phase Comparison

Prefill (shared prompt): Add tokens “Hello world”, pos=[0,1], seq_id=[{0,1,2}]. All 3 sequences share the same KV entries.

Decode (independent generation):

Sequence 0 generates “foo”, pos=2, seq_id=[{0}]
Sequence 1 generates “bar”, pos=2, seq_id=[{1}]
Sequence 2 generates “baz”, pos=2, seq_id=[{2}]

Each sequence writes to its own independent KV slot.

batched.cpp Code Example

The concrete code flow (referencing examples/batched/batched.cpp):

// Prefill phase: all sequences share the same prompt
std::vector<llama_seq_id> all_seqs = {0, 1, 2};
for (size_t i = 0; i < prompt_tokens.size(); ++i) {
    common_batch_add(batch, prompt_tokens[i], i, all_seqs, false);
}
// The last token needs logits output (for the first sampling step)
batch.logits[batch.n_tokens - 1] = true;
llama_decode(ctx, batch);

// Decode phase: each sequence generates independently
while (n_cur <= n_predict) {
    common_batch_clear(batch);
    for (int i = 0; i < n_parallel; ++i) {
        llama_token new_token = llama_sampler_sample(smpl, ctx, i);
        common_batch_add(batch, new_token, n_cur, { i }, true);
    }
    llama_decode(ctx, batch);
    n_cur++;
}

Key point: during prefill, a single token belongs to sequences {0, 1, 2} — the KV cache stores only one copy, but all three sequences can access it. During decoding, each sequence generates different tokens and writes to its own independent KV slot.

Part F: The llama_decode Main Loop

With the batch/ubatch structure understood, let’s see how llama_decode() drives the entire flow:

// src/llama-context.cpp (simplified)
int llama_context::decode(const llama_batch & batch_inp) {
    // 1. Validate and auto-fill missing fields (pos, seq_id, logits)
    balloc->init(batch_inp, vocab, memory.get(), ...);

    // 2. Have the memory module split the batch into ubatches and reserve KV cache slots
    mctx = memory->init_batch(*balloc, cparams.n_ubatch, output_all);
    //    If KV cache is full, attempt optimization (defragmentation) and retry
    //    If still fails, return error code 1 (prompting the caller to do context shift)

    // 3. Pre-allocate output buffer
    output_reserve(n_outputs_all);

    // 4. Process each ubatch
    do {
        const auto & ubatch = mctx->get_ubatch();

        // Build compute graph -> allocate intermediate tensors -> set inputs -> execute computation
        const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx, status);

        // Extract logits and embeddings to the output buffer
        ggml_backend_tensor_get_async(..., res->get_logits(), ...);

        n_outputs_prev += n_outputs;
    } while (mctx->next_ubatch());  // next ubatch

    return 0;
}

The process_ubatch() here is the complete execution flow for a single micro-batch — it encompasses build graph, alloc, set inputs, and compute steps. We will cover these in detail in subsequent chapters.

Summary

The design goal of the two-level batching mechanism is clear:

Level	Parameter	Responsibility
Batch	`--batch-size`	User-facing logical unit — determines how many tokens can be submitted at once
Ubatch	`--ubatch-size`	Hardware-facing physical unit — determines how many tokens the GPU processes at once

Through the combination of seq_id and splitting strategies, llama.cpp achieves flexible batch processing: single-sequence prefill uses split_simple for sequential chunking, multi-sequence parallel processing uses split_equal for balanced allocation, and independent decoding uses split_seq for per-sequence isolation. This mechanism keeps the interface simple while giving the internal scheduler sufficient flexibility.

Next up: #5 Compute Graph Construction & Architecture Dispatch will dive into how llama.cpp builds compute graphs for 125 different architectures, and how the graph reuse mechanism avoids redundant construction.

Part A: llama_batch — The User-Facing Interface

Part B: llama_ubatch — The Internal Micro-Batch

Two-Level Batch Parameters

Part C: Three Splitting Algorithms

1. split_simple(n_ubatch) — Sequential Splitting

2. split_equal(n_ubatch, sequential) — Equal-Length Splitting

3. split_seq(n_ubatch) — Per-Sequence Splitting

Strategy Selection

Interactive Demo

Batch Split Algorithm Visualization

Part D: Prompt Prefill Chunking Example

Part E: Parallel Sequence Decoding

seq_id Prefix Sharing Mechanism

Prefill vs Decode Phase Comparison

batched.cpp Code Example

Part F: The llama_decode Main Loop

Summary

1. `split_simple(n_ubatch)` — Sequential Splitting

2. `split_equal(n_ubatch, sequential)` — Equal-Length Splitting

3. `split_seq(n_ubatch)` — Per-Sequence Splitting