Tool Landscape and GGUF Binary Parsing

Series context: This is article #1 in the llama.cpp source code deep-dive series, focusing on the llama.cpp tool ecosystem and the C++ implementation details of the GGUF format. If you haven’t read the series overview yet, we recommend building the big picture first before diving into this chapter.

Part A: Tool Landscape

llama.cpp provides several executables under its tools/ directory. The three most important are llama-completion, llama-cli, and llama-bench, each targeting different use cases with distinct architectures.

Architecture Differences

Architecture comparison of llama.cpp's three core tools

llama-completion (legacy tool)

completion.cpp

Main entry point

sampling.h

Direct sampling API calls

llama.h

Direct low-level C API calls

llama-cli (next-gen tool)

cli.cpp

Main entry point

server-context.h

Embedded server

server-task.h

Task/Response async pattern

llama.h

llama-bench (benchmark)

llama-bench.cpp

Main entry point

llama.h

Only measures pp/tg speed

llama-completion is the original text generation tool (formerly named main). It directly calls the low-level C APIs from llama.h and sampling.h, manually managing context, KV cache, and sampling. The code structure is straightforward, making it ideal for understanding llama.cpp’s low-level mechanics.

llama-cli is the next-generation interactive chat client. It internally embeds a server instance (pulling in server-context.h and server-task.h), using a task/response reader pattern to process requests asynchronously. It features an ASCII logo, spinner loading animation, and supports multimodal inputs (images/audio) and speculative decoding.

llama-bench is a pure performance benchmarking tool. It doesn’t generate meaningful text — it only measures prompt processing (pp) and token generation (tg) throughput, outputting structured performance data.

Feature Comparison

Feature	llama-completion	llama-cli	llama-bench
Text generation / completion	✅	✅	❌
Interactive chat	✅ Basic	✅ Full	❌
Prompt Cache	✅ `--prompt-cache`	❌	❌
Self-Extend	✅ `--grp-attn-n/w`	❌	❌
`--in-prefix/suffix`	✅	❌	❌
Speculative Decoding	❌	✅ `--draft`	❌
Multimodal (image/audio)	❌	✅ `--image/audio`	❌
Grammar / JSON Schema	✅	✅	❌
All sampling parameters	✅	✅	❌
Performance measurement (pp/tg t/s)	❌	❌	✅

Note: --in-prefix/suffix is only registered under LLAMA_EXAMPLE_COMPLETION; --grp-attn-n is registered under LLAMA_EXAMPLE_COMPLETION and LLAMA_EXAMPLE_PASSKEY, while --grp-attn-w is only registered under LLAMA_EXAMPLE_COMPLETION (see common/arg.cpp). Neither llama-cli nor llama-bench supports these flags.

Recommendations

Everyday chat -> llama-cli: best UX, latest features, supports multimodal and speculative decoding
Fine-grained control / scripted batch processing -> llama-completion: prompt cache, Self-Extend, custom prefix/suffix
Hardware / backend performance evaluation -> llama-bench: pure pp and tg throughput testing

Part B: GGUF Field-by-Field Parsing

llama.cpp uses GGUF (GGML Universal File Format) as its model file format. Understanding the GGUF structure is the foundation for understanding all subsequent loading workflows.

Physical File Layout

A GGUF file is laid out sequentially from start to end as follows:

GGUF physical file layout

Header (fixed fields)

Magic: 'GGUF'(4B) | Version: uint32(v3) | n_tensors: int64 | n_kv: int64

KV Metadata (variable length)

Key-value pairs: architecture, chat_template, ...

Tensor Info Array (variable length)

Per tensor: name + n_dims + ne[] + type + offset

Alignment Padding

Padded to alignment boundary (default 32 bytes)

Tensor Data Blob (multiple GB)

Binary weight data for all tensors, laid out sequentially by offset

Key takeaway: you only need to read the header to get all tensor metadata (name, shape, type) — no need to read the data blob. For a 7B model, the header is typically just a few MB, while the data blob is several GB.

Parsing Flow

The GGUF parsing entry point is gguf_init_from_file_ptr() (ggml/src/gguf.cpp), which reads in the following order:

// 1. Validate magic
char magic[4];
read(magic);  // Must be "GGUF"

// 2. Read version number
uint32_t version;
read(version);  // Currently supports v2, v3

// 3. Read tensor count and KV count
int64_t n_tensors, n_kv;
read(n_tensors);
read(n_kv);

// 4. Parse all KV metadata
for (int64_t i = 0; i < n_kv; i++) {
    // Read key (string) + value type + value
}

// 5. Parse all tensor info (does NOT read data!)
for (int64_t i = 0; i < n_tensors; i++) {
    read(name);          // Tensor name
    read(n_dims);        // Number of dimensions
    read(ne[0..3]);      // Element count per dimension (shape)
    read(type);          // Quantization type: Q4_0, Q4_K, F16, etc.
    // Compute stride from type
    nb[0] = ggml_type_size(type);
    nb[1] = nb[0] * (ne[0] / ggml_blck_size(type));
    read(offset);        // Offset within the data blob
}

// 6. Seek to data section start (after alignment padding)
fseek(file, GGML_PAD(current_pos, alignment), SEEK_SET);
ctx->offset = ftell(file);  // Record data section start offset

Note that in step 5, each tensor info entry contains only metadata, not the actual weight data. After parsing all tensor info entries, the entire header is complete. Whether the data blob is subsequently read depends on the no_alloc parameter passed by the caller.

Tensor Info Structure

Each tensor’s metadata corresponds to the following C structures:

// gguf.cpp
struct gguf_tensor_info {
    struct ggml_tensor t;   // Contains type, ne[4], nb[4], name
    uint64_t offset;        // Byte offset within the data blob
};

// ggml.h (simplified)
struct ggml_tensor {
    enum ggml_type type;            // Quantization type: GGML_TYPE_Q4_K, GGML_TYPE_F16, etc.
    int64_t ne[GGML_MAX_DIMS];      // Shape: element count per dimension
    size_t  nb[GGML_MAX_DIMS];      // Byte stride: bytes per step in each dimension
    char name[GGML_MAX_NAME];       // Tensor name
    void * data;                    // Data pointer (NULL during header parsing)
};

KV Metadata

KV metadata stores various meta-information about the model. Common keys include:

Key	Type	Description
`general.architecture`	string	Model architecture name, e.g. `"llama"`, `"qwen3"`
`general.name`	string	Model display name
`tokenizer.chat_template`	string	Jinja2 chat template (covered in detail in later chapters)
`{arch}.context_length`	uint32	Maximum context length used during training
`{arch}.embedding_length`	uint32	Embedding dimension
`{arch}.block_count`	uint32	Number of transformer layers
`{arch}.attention.head_count`	uint32	Number of attention heads

When loading a model, llama.cpp first reads general.architecture to determine the model architecture, then reads the corresponding hyperparameters based on that.

Part C: Quantization Block Structure

The quantization type in GGUF determines how weights are stored. Each quantization type defines a fixed-size block containing a group of quantized weight values along with the scale/min parameters needed for dequantization.

Q4_0: The Simplest 4-bit Quantization

#define QK4_0 32  // 32 elements per block
typedef struct {
    ggml_half d;           // delta (scale), f16 format
    uint8_t qs[QK4_0 / 2]; // 16 bytes: each byte stores 2 4-bit quantized values
} block_q4_0;
// sizeof = 2 + 16 = 18 bytes -> 32 weights -> 4.5 bits/weight

Dequantization formula: $w = d \cdot (q - 8)$ , where $q$ is a 4-bit unsigned integer (0~15), subtracting 8 to obtain a signed value.

Q4_K: K-quant Series 4-bit Quantization

#define QK_K 256         // super-block: 256 elements
#define K_SCALE_SIZE 12  // bytes for scales

typedef struct {
    union {
        struct {
            ggml_half d;    // super-block scale (for quantizing scales)
            ggml_half dmin; // super-block scale (for quantizing mins)
        };
        ggml_half2 dm;
    };
    uint8_t scales[K_SCALE_SIZE]; // 12 bytes: scale and min for 8 sub-blocks, 6-bit quantized
    uint8_t qs[QK_K/2];           // 128 bytes: 256 4-bit quantized values
} block_q4_K;
// sizeof = 4 + 12 + 128 = 144 bytes -> 256 weights -> 4.5 bits/weight

Q4_K uses a two-level quantization structure:

Q4_K two-level quantization structure

Super-block (256 weights)

d, dmin

2 f16 global scales

scales[12]

Scale/min for 8 sub-blocks (6-bit quantized)

qs[128]

256 4-bit quantized values

Dequantizing a single weight

1. Extract sub-block scale and min from scales[]

2. scale = d * raw_scale, min = dmin * raw_min

3. weight = scale * q - min

Dequantization formula: $w = d \cdot \text{raw\_scale} \cdot q - d_{\min} \cdot \text{raw\_min}$ , where raw_scale and raw_min are decoded from the 6-bit encoding in scales[]. The term $d_{\min} \cdot \text{raw\_min}$ is subtracted as a zero-point offset.

Q8_0: 8-bit Quantization

#define QK8_0 32  // 32 elements per block
typedef struct {
    ggml_half d;       // delta (scale), f16 format
    int8_t qs[QK8_0];  // 32 bytes: 32 int8 quantized values
} block_q8_0;
// sizeof = 2 + 32 = 34 bytes -> 32 weights -> 8.5 bits/weight

Q8_0 is typically not used directly as a model storage format. Instead, it serves as a runtime intermediate format — during dot product execution, activations are quantized to Q8_0/Q8_K to compute dot products with the weights.

Dot Product Pairing

When performing quantized matrix multiplication, the quantization types of the two operands need to be paired:

Weight Type	Paired Dot Product Type	Description
Q4_0	Q8_0	4-bit weights x 8-bit activations
Q4_K	Q8_K	K-quant 4-bit x K-quant 8-bit
Q5_K	Q8_K	K-quant 5-bit x K-quant 8-bit
Q6_K	Q8_K	K-quant 6-bit x K-quant 8-bit
F16	F16	Half-precision weights x half-precision activations

When executing matrix multiplication, the backend looks up the weight tensor’s type in a table (type_traits_cpu[]) to find the corresponding vec_dot function and vec_dot_type. If the activation type doesn’t match vec_dot_type, a temporary buffer is allocated to perform type conversion first.

Summary

This chapter covered two foundational aspects of llama.cpp:

Tool ecosystem: llama-completion (legacy, fine-grained control), llama-cli (next-gen, most feature-rich), and llama-bench (pure performance measurement) — three tools with distinct roles, chosen based on use case
GGUF format: the five-section layout of Header -> KV Metadata -> Tensor Info -> Padding -> Data Blob, parsed in a single sequential read via gguf_init_from_file_ptr()
Quantization blocks: struct definitions and dequantization formulas for Q4_0 (simple scale + 4-bit), Q4_K (two-level super-block/sub-block quantization), and Q8_0 (runtime intermediate format), plus the table-lookup mechanism for dot product pairing

The next article, Model Loading, traces how weights are loaded into GPU/CPU memory via mmap or buffer upload after GGUF file parsing is complete.