Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Batch, Ubatch & the Decoding Main Loop

Batch, Ubatch & the Decoding Main Loop

Updated 2026-04-15

Series context: This is article #4 in the llama.cpp source code deep-dive series, covering how tokens are organized from user submission to actual computation — batch, ubatch, and the decode main loop. If you haven’t read the series overview and #3 Warmup, Tokenization & Chat Template, we recommend building the big picture first before diving into this chapter.

After tokenization, we have a token sequence. But before feeding it into the model for computation, these tokens need to be organized into batches. llama.cpp employs a two-level batching mechanism: a user-submitted logical batch and an internal micro-batch (ubatch) used for actual computation.


Part A: llama_batch — The User-Facing Interface

llama_batch is the core data structure for user interaction with llama.cpp, defined in the public header include/llama.h:

typedef struct llama_batch {
    int32_t n_tokens;

    llama_token  *  token;    // token ID array
    float        *  embd;     // or provide embeddings directly (mutually exclusive with token)
    llama_pos    *  pos;      // position of each token in its sequence
    int32_t      *  n_seq_id; // number of sequences each token belongs to
    llama_seq_id ** seq_id;   // list of sequence IDs for each token
    int8_t       *  logits;   // which tokens need logits output
} llama_batch;

Field descriptions:

FieldDescription
tokenToken ID array, size n_tokens. Mutually exclusive with embd
embdProvide embedding vectors directly, size n_tokens * n_embd. Used for external encoder scenarios
posPosition of each token within its sequence. Can be NULL (auto-increment)
n_seq_id / seq_idEach token can belong to multiple sequences simultaneously, enabling prefix sharing
logitsMarks which tokens need logits computed (sampling only needs the last token)

llama.cpp provides two convenience functions for creating batches:

// Simplest usage: a contiguous token sequence, auto-fills pos and seq_id
struct llama_batch llama_batch_get_one(llama_token * tokens, int32_t n_tokens);

// Allocate an empty batch that can hold n_tokens, requires manual population
struct llama_batch llama_batch_init(int32_t n_tokens, int32_t embd, int32_t n_seq_max);

llama_batch_get_one() suits the simplest scenario: given a set of contiguous tokens, it auto-fills position and sequence information. llama_batch_init() allocates an empty batch for the caller to populate manually — essential for advanced scenarios like parallel sequence decoding.


Part B: llama_ubatch — The Internal Micro-Batch

A user-submitted batch can be very large (e.g., a 2048-token prompt) and cannot be sent to the GPU all at once. Internally, llama.cpp splits it into smaller ubatches (micro-batches), defined in src/llama-batch.h:

struct llama_ubatch {
    uint32_t n_tokens;     // total tokens = n_seq_tokens * n_seqs
    uint32_t n_seq_tokens; // tokens per sequence set
    uint32_t n_seqs;       // number of sequence sets
    uint32_t n_seqs_unq;   // number of unique sequence IDs

    llama_token  *  token;      // [n_tokens]
    llama_pos    *  pos;        // [n_tokens]
    int32_t      *  n_seq_id;   // [n_tokens]
    llama_seq_id ** seq_id;     // [n_tokens]
    llama_seq_id *  seq_id_unq; // [n_seqs_unq] unique sequence IDs
    int32_t      *  seq_idx;    // sequence ID -> index mapping within the ubatch
    int8_t       *  output;     // [n_tokens] output flags
    // ...
};

The key difference between ubatch and batch is that ubatch tracks sequence set structural information. n_seq_tokens and n_seqs explicitly define the “how many sequences, how many tokens per sequence” matrix shape, which is critical for subsequent attention computation and KV cache management.

Two-Level Batch Parameters

These two levels are controlled by two command-line parameters:

--batch-size  N   Logical batch size (default 2048) — max tokens a user can submit at once
--ubatch-size N   Physical batch size (default 512)  — max tokens the GPU processes at once

They correspond to n_batch and n_ubatch in llama_context_params:

// common/common.h defaults
int32_t n_batch  = 2048;  // logical max batch
int32_t n_ubatch =  512;  // physical max batch

Key constraint: n_ubatch <= n_batch. n_batch determines how many new tokens the KV cache can accept at once, while n_ubatch determines the GPU’s actual computation granularity. Reducing n_ubatch lowers peak GPU memory usage but increases the number of splits (more ubatches = more kernel launch overhead).


Part C: Three Splitting Algorithms

The llama_batch_allocr class is responsible for splitting a batch into ubatches. It provides three splitting strategies:

Batch to ubatch splitting paths
llama_batch
User-submitted
llama_batch_allocr
split_simple()
split_equal()
split_seq()
ubatch 0, 1, 2 ...

1. split_simple(n_ubatch) — Sequential Splitting

The most straightforward approach: take tokens sequentially from start to end, up to n_ubatch tokens at a time. Ignores sequence boundaries.

Use case: single-sequence prompt prefill. For example, a 1000-token prompt with n_ubatch=512 splits into two ubatches (512 + 488).

2. split_equal(n_ubatch, sequential) — Equal-Length Splitting

Ensures that each sequence set within the same ubatch has an equal number of tokens. The algorithm first selects non-overlapping sequence sets, then takes the same number of tokens from each set until the total approaches the n_ubatch limit or a sequence is exhausted.

Use case: multi-sequence parallel prefill, where balanced progress across sequences is needed.

3. split_seq(n_ubatch) — Per-Sequence Splitting

Each ubatch contains tokens from only one sequence set. Different sequences are never mixed into the same ubatch.

Use case: autoregressive decoding phase, where each sequence generates independently.

Strategy Selection

The choice of splitting strategy occurs in the KV cache’s init_batch():

// src/llama-kv-cache.cpp
llama_memory_context_ptr llama_kv_cache::init_batch(
        llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) {
    balloc.split_reset();

    std::vector<llama_ubatch> ubatches;
    while (true) {
        // Single-stream mode uses split_simple, multi-stream uses split_equal
        auto ubatch = n_stream == 1
            ? balloc.split_simple(n_ubatch)
            : balloc.split_equal(n_ubatch, true);

        if (ubatch.n_tokens == 0) break;
        ubatches.push_back(std::move(ubatch));
    }

    // Reserve KV cache slots for each ubatch
    auto sinfos = prepare(ubatches);
    // ...
}

The logic is clean: single-sequence streams (n_stream == 1) use split_simple, multi-sequence streams use split_equal. split_seq is used in other specific code paths.

Interactive Demo

Adjust the parameters below to see how the three splitting algorithms behave under different configurations:

Batch Split Algorithm Visualization

Sequential split: take tokens front-to-back, ignoring sequence boundaries
Total: 3 ubatches
ubatch 0512 tokens (Seq0: 512)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
+452
ubatch 1512 tokens (Seq0: 512)
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
+452
ubatch 2476 tokens (Seq0: 476)
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
+416

Part D: Prompt Prefill Chunking Example

Suppose the user submits a 1500-token prompt with n_batch=2048 and n_ubatch=512:

Prefill chunking flow for a 1500-token prompt
User prompt
1500 tokens
llama_batch
n_tokens=1500
ubatch 0
tokens [0..511]
ubatch 1
tokens [512..1023]
ubatch 2
tokens [1024..1499]
process_ubatch()
process_ubatch()
process_ubatch()
KV cache
pos 0-511
KV cache
pos 512-1023
KV cache
pos 1024-1499

Each ubatch independently goes through the full build graph, alloc, and compute flow, but they share the same KV cache — KV data written by earlier ubatches is visible to the attention computation of later ubatches.


Part E: Parallel Sequence Decoding

The seq_id mechanism allows multiple sequences to share a prompt prefix and then decode independently. This is the core principle behind the --parallel N parameter.

seq_id Prefix Sharing Mechanism

During the prefill phase, a single token can belong to multiple sequences simultaneously. For example, marking the tokens of “Hello world” with seq_id=[{0,1,2}] means all three sequences share the same KV cache data. During the decode phase, each sequence generates different tokens and writes to its own independent KV slots.

Prefill vs Decode Phase Comparison

Prefill (shared prompt): Add tokens “Hello world”, pos=[0,1], seq_id=[{0,1,2}]. All 3 sequences share the same KV entries.

Decode (independent generation):

  • Sequence 0 generates “foo”, pos=2, seq_id=[{0}]
  • Sequence 1 generates “bar”, pos=2, seq_id=[{1}]
  • Sequence 2 generates “baz”, pos=2, seq_id=[{2}]

Each sequence writes to its own independent KV slot.

batched.cpp Code Example

The concrete code flow (referencing examples/batched/batched.cpp):

// Prefill phase: all sequences share the same prompt
std::vector<llama_seq_id> all_seqs = {0, 1, 2};
for (size_t i = 0; i < prompt_tokens.size(); ++i) {
    common_batch_add(batch, prompt_tokens[i], i, all_seqs, false);
}
// The last token needs logits output (for the first sampling step)
batch.logits[batch.n_tokens - 1] = true;
llama_decode(ctx, batch);

// Decode phase: each sequence generates independently
while (n_cur <= n_predict) {
    common_batch_clear(batch);
    for (int i = 0; i < n_parallel; ++i) {
        llama_token new_token = llama_sampler_sample(smpl, ctx, i);
        common_batch_add(batch, new_token, n_cur, { i }, true);
    }
    llama_decode(ctx, batch);
    n_cur++;
}

Key point: during prefill, a single token belongs to sequences {0, 1, 2} — the KV cache stores only one copy, but all three sequences can access it. During decoding, each sequence generates different tokens and writes to its own independent KV slot.


Part F: The llama_decode Main Loop

With the batch/ubatch structure understood, let’s see how llama_decode() drives the entire flow:

// src/llama-context.cpp (simplified)
int llama_context::decode(const llama_batch & batch_inp) {
    // 1. Validate and auto-fill missing fields (pos, seq_id, logits)
    balloc->init(batch_inp, vocab, memory.get(), ...);

    // 2. Have the memory module split the batch into ubatches and reserve KV cache slots
    mctx = memory->init_batch(*balloc, cparams.n_ubatch, output_all);
    //    If KV cache is full, attempt optimization (defragmentation) and retry
    //    If still fails, return error code 1 (prompting the caller to do context shift)

    // 3. Pre-allocate output buffer
    output_reserve(n_outputs_all);

    // 4. Process each ubatch
    do {
        const auto & ubatch = mctx->get_ubatch();

        // Build compute graph -> allocate intermediate tensors -> set inputs -> execute computation
        const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx, status);

        // Extract logits and embeddings to the output buffer
        ggml_backend_tensor_get_async(..., res->get_logits(), ...);

        n_outputs_prev += n_outputs;
    } while (mctx->next_ubatch());  // next ubatch

    return 0;
}

The process_ubatch() here is the complete execution flow for a single micro-batch — it encompasses build graph, alloc, set inputs, and compute steps. We will cover these in detail in subsequent chapters.


Summary

The design goal of the two-level batching mechanism is clear:

LevelParameterResponsibility
Batch--batch-sizeUser-facing logical unit — determines how many tokens can be submitted at once
Ubatch--ubatch-sizeHardware-facing physical unit — determines how many tokens the GPU processes at once

Through the combination of seq_id and splitting strategies, llama.cpp achieves flexible batch processing: single-sequence prefill uses split_simple for sequential chunking, multi-sequence parallel processing uses split_equal for balanced allocation, and independent decoding uses split_seq for per-sequence isolation. This mechanism keeps the interface simple while giving the internal scheduler sufficient flexibility.

Next up: #5 Compute Graph Construction & Architecture Dispatch will dive into how llama.cpp builds compute graphs for 125 different architectures, and how the graph reuse mechanism avoids redundant construction.