Warmup, Tokenization & Chat Template

Series context: This is article #3 of the llama.cpp Source Code Deep Dive series, covering the three preparatory stages before model inference — Warmup, Tokenization, and Chat Template rendering. If you haven’t read the Series Overview and #2 Model Loading: From File to Device, we recommend building the big picture first before diving into this chapter.

Part A: Warmup Mechanism

After backend initialization is complete, llama.cpp performs a warmup — a single full forward pass with a minimal number of tokens. The goal is not to generate meaningful text, but to let the entire pipeline complete its first resource allocation.

What Warmup Does

// common/common.cpp — common_init_from_params()
if (params.warmup) {
    LOG_WRN("warming up the model with an empty run - please wait ...\n");

    llama_set_warmup(lctx, true);

    // Construct a minimal batch: just BOS + EOS tokens
    std::vector<llama_token> tmp;
    llama_token bos = llama_vocab_bos(vocab);
    llama_token eos = llama_vocab_eos(vocab);
    if (bos != LLAMA_TOKEN_NULL) tmp.push_back(bos);
    if (eos != LLAMA_TOKEN_NULL) tmp.push_back(eos);
    if (tmp.empty()) tmp.push_back(0);

    // Encoder-decoder models (e.g., T5) need to encode first, then decode
    if (llama_model_has_encoder(model)) {
        llama_encode(lctx, llama_batch_get_one(tmp.data(), tmp.size()));
        // ...
    }
    if (llama_model_has_decoder(model)) {
        llama_decode(lctx, llama_batch_get_one(tmp.data(), tmp.size()));
    }

    // Clear KV cache produced during warmup
    llama_memory_clear(llama_get_memory(lctx), true);
    llama_synchronize(lctx);
    llama_perf_context_reset(lctx);

    llama_set_warmup(lctx, false);
    // Reset sampler RNG state to ensure reproducible seeds during actual inference
    res->reset_samplers();
}

The logic is straightforward: construct a minimal batch containing only BOS and EOS, run through the full encode/decode pipeline, then clear all state produced during warmup to ensure actual inference starts from a clean slate.

Why Warmup Is Needed

GPU memory pre-allocation: The first call to llama_decode() triggers the backend scheduler to allocate GPU buffers for intermediate tensors. Without warmup, the user’s first prompt would incur the additional latency of these allocations.
Kernel compilation: Some backends (e.g., Vulkan, Metal) need to compile shaders/kernels on the first execution of a given op. Warmup moves this latency to the initialization phase.
KV cache initialization: Ensures KV cache memory is allocated and mapped to the correct device.

Encoder-Decoder Model Handling

Note the branching in the code: if the model has both an encoder and a decoder (e.g., T5), warmup calls llama_encode() first and then llama_decode(), ensuring both sides of the pipeline are warmed up. For pure decoder-only models (GPT, LLaMA), only a single llama_decode() call is needed.

`--no-warmup` Use Cases

Warmup can be disabled with --no-warmup. In benchmarking scenarios, warmup is typically disabled to obtain more accurate time-to-first-token latency measurements.

Part B: Tokenization

After model loading and warmup, the system is ready to process user input. The first step is converting text into token IDs.

Tokenization Pipeline

User Text

'Hello, world!'

llama_tokenize()

vocab->tokenize()

BPE / SentencePiece

Token IDs

[15496, 11, 1917, 0]

llama_tokenize() is a simple proxy function:

// llama-vocab.cpp
int32_t llama_tokenize(
    const llama_vocab * vocab,
    const char * text, int32_t text_len,
    llama_token * tokens, int32_t n_tokens_max,
    bool add_special, bool parse_special)
{
    return vocab->tokenize(text, text_len, tokens, n_tokens_max, add_special, parse_special);
}

Two key parameters:

add_special: Whether to automatically add special tokens like BOS (Begin of Sentence)
parse_special: Whether to parse special token markers in the text (e.g., <|im_start|>)

Tokenizer Implementation Is Model-Determined

The specific tokenizer implementation (BPE, SentencePiece, etc.) is determined by the model. This information is stored in the GGUF KV metadata. llama.cpp reads the tokenizer.ggml.model field during model loading to determine which tokenizer to use.

Reverse Detokenization

The reverse process (token to text) uses llama_token_to_piece() (single token) or llama_detokenize() (batch). In streaming output scenarios, each newly generated token must be detokenized back into a text fragment.

Part C: Chat Template

Why You Can’t Just Concatenate Prompts

Different models use different special tokens during training to mark role transitions. If you simply concatenate the system prompt and user prompt:

你是助手你好

The model has no way to tell which part is the system instruction, which part is the user input, or where it should start generating its response.

Taking the same conversation as an example, different models expect completely different formats:

ChatML format (Qwen, etc.):

<|im_start|>system
你是助手<|im_end|>
<|im_start|>user
你好<|im_end|>
<|im_start|>assistant

Llama-3 format:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

你是助手<|eot_id|><|start_header_id|>user<|end_header_id|>

你好<|eot_id|><|start_header_id|>assistant<|end_header_id|>

These <|im_start|>, <|start_header_id|>, etc. are special tokens — they have dedicated token IDs in the vocabulary, and the model learned during training to “switch roles when these tokens appear.”

Try the interactive component below to see firsthand how different templates render the same message list:

Chat Template Rendering Comparison

Messages

Rendered Output

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello, please introduce yourself<|im_end|>
<|im_start|>assistant

Special tokenRole markerContent text

Jinja2 Template

{%- for message in messages -%}
  {{- '<|im_start|>' + message.role + '\n'
      + message.content + '<|im_end|>\n' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
  {{- '<|im_start|>assistant\n' -}}
{%- endif -%}

What Is a Chat Template

A chat template is a Jinja2 template string embedded in the GGUF file’s KV metadata (key: tokenizer.chat_template). Taking the ChatML template as an example:

{%- for message in messages -%}
  {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
  {{- '<|im_start|>assistant\n' -}}
{%- endif -%}

This template receives a messages list (each message has a role and content) and outputs the formatted text the model expects.

Loading and Rendering Pipeline

Chat Template Loading and Rendering Pipeline

GGUF Model File

tokenizer.chat_template

common_chat_templates_init()

Read template string

Fallback Logic

No template -> ChatML

User Input Messages

[{role, content}, ...]

common_chat_templates_apply()

Jinja2 rendering

Formatted Prompt Text

Complete prompt with special tokens

Key Code Path

// chat.cpp — common_chat_templates_init()
const auto * str = llama_model_chat_template(model, nullptr);
if (str) {
    default_template_src = str;  // Read from GGUF
}

// If no template, prefer tool_use template, otherwise fall back to ChatML
if (default_template_src.empty() || default_template_src == "chatml") {
    if (!template_tool_use_src.empty()) {
        default_template_src = template_tool_use_src;
    } else {
        default_template_src = CHATML_TEMPLATE_SRC;
    }
}

common_chat_templates_init() first attempts to read tokenizer.chat_template from the GGUF metadata. If the model doesn’t provide a template (or the template is simply "chatml"), it follows the fallback logic: prefer the tool_use template (if available), otherwise use the built-in ChatML template. This fallback strategy ensures that even if a model doesn’t embed a chat template, llama.cpp can still process conversations in a reasonable format.

Part D: Multimodal Token Injection

For multimodal models (e.g., Gemma3, LLaVA), images and audio need to be converted into special token sequences and injected into the prompt. This is handled by the tools/mtmd/ module.

`mtmd_tokenize()` Pipeline

Multimodal Token Injection Pipeline

Text + Image Path

mtmd_tokenize()

Text Portion

Regular tokens

Image Markers

<start_of_image> + vision tokens + <end_of_image>

Merged into mtmd_input_chunks

mtmd_encode_chunk()

Vision encoder processes image

Final token sequence fed to LLM

Vision Encoder and mmproj

The vision encoder (based on CLIP) is loaded as a separate model file (mmproj). It encodes images into a set of embedding vectors that occupy specific positions in the token sequence. The LLM treats them as special “vision tokens.”

The benefit of this design is decoupling: the LLM itself doesn’t need to understand raw pixel data from images. It only needs to process the embeddings produced by the vision encoder, just like processing regular token embeddings. The mmproj file can be updated independently of the LLM model, and different vision encoders can be paired with the same LLM.

Summary

This article covered the three key steps before user input reaches the LLM:

Warmup: Run a full inference pass with a minimal batch to pre-allocate GPU memory, compile kernels, and initialize the KV cache
Tokenization: Convert text into token ID sequences via llama_tokenize(), with the tokenizer type determined by GGUF metadata
Chat Template: Use a Jinja2 template to render a messages list into the formatted prompt the model expects, complete with the correct special tokens
Multimodal injection: For vision models, images are converted to embeddings by a vision encoder and injected into the token sequence

Next up: #4 Batch and Ubatch dives into llama.cpp’s batching mechanism — how token sequences are organized into efficient compute batches for the model.