Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Tool Landscape and GGUF Binary Parsing

Tool Landscape and GGUF Binary Parsing

Updated 2026-04-15

Series context: This is article #1 in the llama.cpp source code deep-dive series, focusing on the llama.cpp tool ecosystem and the C++ implementation details of the GGUF format. If you haven’t read the series overview yet, we recommend building the big picture first before diving into this chapter.

Part A: Tool Landscape

llama.cpp provides several executables under its tools/ directory. The three most important are llama-completion, llama-cli, and llama-bench, each targeting different use cases with distinct architectures.

Architecture Differences

Architecture comparison of llama.cpp's three core tools
llama-completion (legacy tool)
completion.cpp
Main entry point
sampling.h
Direct sampling API calls
llama.h
Direct low-level C API calls
llama-cli (next-gen tool)
cli.cpp
Main entry point
server-context.h
Embedded server
server-task.h
Task/Response async pattern
llama.h
llama-bench (benchmark)
llama-bench.cpp
Main entry point
llama.h
Only measures pp/tg speed

llama-completion is the original text generation tool (formerly named main). It directly calls the low-level C APIs from llama.h and sampling.h, manually managing context, KV cache, and sampling. The code structure is straightforward, making it ideal for understanding llama.cpp’s low-level mechanics.

llama-cli is the next-generation interactive chat client. It internally embeds a server instance (pulling in server-context.h and server-task.h), using a task/response reader pattern to process requests asynchronously. It features an ASCII logo, spinner loading animation, and supports multimodal inputs (images/audio) and speculative decoding.

llama-bench is a pure performance benchmarking tool. It doesn’t generate meaningful text β€” it only measures prompt processing (pp) and token generation (tg) throughput, outputting structured performance data.

Feature Comparison

Featurellama-completionllama-clillama-bench
Text generation / completionβœ…βœ…βŒ
Interactive chatβœ… Basicβœ… Full❌
Prompt Cacheβœ… --prompt-cache❌❌
Self-Extendβœ… --grp-attn-n/w❌❌
--in-prefix/suffixβœ…βŒβŒ
Speculative DecodingβŒβœ… --draft❌
Multimodal (image/audio)βŒβœ… --image/audio❌
Grammar / JSON Schemaβœ…βœ…βŒ
All sampling parametersβœ…βœ…βŒ
Performance measurement (pp/tg t/s)βŒβŒβœ…

Note: --in-prefix/suffix is only registered under LLAMA_EXAMPLE_COMPLETION; --grp-attn-n is registered under LLAMA_EXAMPLE_COMPLETION and LLAMA_EXAMPLE_PASSKEY, while --grp-attn-w is only registered under LLAMA_EXAMPLE_COMPLETION (see common/arg.cpp). Neither llama-cli nor llama-bench supports these flags.

Recommendations

  • Everyday chat -> llama-cli: best UX, latest features, supports multimodal and speculative decoding
  • Fine-grained control / scripted batch processing -> llama-completion: prompt cache, Self-Extend, custom prefix/suffix
  • Hardware / backend performance evaluation -> llama-bench: pure pp and tg throughput testing

Part B: GGUF Field-by-Field Parsing

llama.cpp uses GGUF (GGML Universal File Format) as its model file format. Understanding the GGUF structure is the foundation for understanding all subsequent loading workflows.

Physical File Layout

A GGUF file is laid out sequentially from start to end as follows:

GGUF physical file layout
Header (fixed fields)
Magic: 'GGUF'(4B) | Version: uint32(v3) | n_tensors: int64 | n_kv: int64
KV Metadata (variable length)
Key-value pairs: architecture, chat_template, ...
Tensor Info Array (variable length)
Per tensor: name + n_dims + ne[] + type + offset
Alignment Padding
Padded to alignment boundary (default 32 bytes)
Tensor Data Blob (multiple GB)
Binary weight data for all tensors, laid out sequentially by offset

Key takeaway: you only need to read the header to get all tensor metadata (name, shape, type) β€” no need to read the data blob. For a 7B model, the header is typically just a few MB, while the data blob is several GB.

Parsing Flow

The GGUF parsing entry point is gguf_init_from_file_ptr() (ggml/src/gguf.cpp), which reads in the following order:

// 1. Validate magic
char magic[4];
read(magic);  // Must be "GGUF"

// 2. Read version number
uint32_t version;
read(version);  // Currently supports v2, v3

// 3. Read tensor count and KV count
int64_t n_tensors, n_kv;
read(n_tensors);
read(n_kv);

// 4. Parse all KV metadata
for (int64_t i = 0; i < n_kv; i++) {
    // Read key (string) + value type + value
}

// 5. Parse all tensor info (does NOT read data!)
for (int64_t i = 0; i < n_tensors; i++) {
    read(name);          // Tensor name
    read(n_dims);        // Number of dimensions
    read(ne[0..3]);      // Element count per dimension (shape)
    read(type);          // Quantization type: Q4_0, Q4_K, F16, etc.
    // Compute stride from type
    nb[0] = ggml_type_size(type);
    nb[1] = nb[0] * (ne[0] / ggml_blck_size(type));
    read(offset);        // Offset within the data blob
}

// 6. Seek to data section start (after alignment padding)
fseek(file, GGML_PAD(current_pos, alignment), SEEK_SET);
ctx->offset = ftell(file);  // Record data section start offset

Note that in step 5, each tensor info entry contains only metadata, not the actual weight data. After parsing all tensor info entries, the entire header is complete. Whether the data blob is subsequently read depends on the no_alloc parameter passed by the caller.

Tensor Info Structure

Each tensor’s metadata corresponds to the following C structures:

// gguf.cpp
struct gguf_tensor_info {
    struct ggml_tensor t;   // Contains type, ne[4], nb[4], name
    uint64_t offset;        // Byte offset within the data blob
};

// ggml.h (simplified)
struct ggml_tensor {
    enum ggml_type type;            // Quantization type: GGML_TYPE_Q4_K, GGML_TYPE_F16, etc.
    int64_t ne[GGML_MAX_DIMS];      // Shape: element count per dimension
    size_t  nb[GGML_MAX_DIMS];      // Byte stride: bytes per step in each dimension
    char name[GGML_MAX_NAME];       // Tensor name
    void * data;                    // Data pointer (NULL during header parsing)
};

KV Metadata

KV metadata stores various meta-information about the model. Common keys include:

KeyTypeDescription
general.architecturestringModel architecture name, e.g. "llama", "qwen3"
general.namestringModel display name
tokenizer.chat_templatestringJinja2 chat template (covered in detail in later chapters)
{arch}.context_lengthuint32Maximum context length used during training
{arch}.embedding_lengthuint32Embedding dimension
{arch}.block_countuint32Number of transformer layers
{arch}.attention.head_countuint32Number of attention heads

When loading a model, llama.cpp first reads general.architecture to determine the model architecture, then reads the corresponding hyperparameters based on that.

Part C: Quantization Block Structure

The quantization type in GGUF determines how weights are stored. Each quantization type defines a fixed-size block containing a group of quantized weight values along with the scale/min parameters needed for dequantization.

Q4_0: The Simplest 4-bit Quantization

#define QK4_0 32  // 32 elements per block
typedef struct {
    ggml_half d;           // delta (scale), f16 format
    uint8_t qs[QK4_0 / 2]; // 16 bytes: each byte stores 2 4-bit quantized values
} block_q4_0;
// sizeof = 2 + 16 = 18 bytes -> 32 weights -> 4.5 bits/weight

Dequantization formula: w=dβ‹…(qβˆ’8)w = d \cdot (q - 8), where qq is a 4-bit unsigned integer (0~15), subtracting 8 to obtain a signed value.

Q4_K: K-quant Series 4-bit Quantization

#define QK_K 256         // super-block: 256 elements
#define K_SCALE_SIZE 12  // bytes for scales

typedef struct {
    union {
        struct {
            ggml_half d;    // super-block scale (for quantizing scales)
            ggml_half dmin; // super-block scale (for quantizing mins)
        };
        ggml_half2 dm;
    };
    uint8_t scales[K_SCALE_SIZE]; // 12 bytes: scale and min for 8 sub-blocks, 6-bit quantized
    uint8_t qs[QK_K/2];           // 128 bytes: 256 4-bit quantized values
} block_q4_K;
// sizeof = 4 + 12 + 128 = 144 bytes -> 256 weights -> 4.5 bits/weight

Q4_K uses a two-level quantization structure:

Q4_K two-level quantization structure
Super-block (256 weights)
d, dmin
2 f16 global scales
scales[12]
Scale/min for 8 sub-blocks (6-bit quantized)
qs[128]
256 4-bit quantized values
Dequantizing a single weight
1. Extract sub-block scale and min from scales[]
2. scale = d * raw_scale, min = dmin * raw_min
3. weight = scale * q - min

Dequantization formula: w=dβ‹…raw_scaleβ‹…qβˆ’dmin⁑⋅raw_minw = d \cdot \text{raw\_scale} \cdot q - d_{\min} \cdot \text{raw\_min}, where raw_scale and raw_min are decoded from the 6-bit encoding in scales[]. The term dmin⁑⋅raw_mind_{\min} \cdot \text{raw\_min} is subtracted as a zero-point offset.

Q8_0: 8-bit Quantization

#define QK8_0 32  // 32 elements per block
typedef struct {
    ggml_half d;       // delta (scale), f16 format
    int8_t qs[QK8_0];  // 32 bytes: 32 int8 quantized values
} block_q8_0;
// sizeof = 2 + 32 = 34 bytes -> 32 weights -> 8.5 bits/weight

Q8_0 is typically not used directly as a model storage format. Instead, it serves as a runtime intermediate format β€” during dot product execution, activations are quantized to Q8_0/Q8_K to compute dot products with the weights.

Dot Product Pairing

When performing quantized matrix multiplication, the quantization types of the two operands need to be paired:

Weight TypePaired Dot Product TypeDescription
Q4_0Q8_04-bit weights x 8-bit activations
Q4_KQ8_KK-quant 4-bit x K-quant 8-bit
Q5_KQ8_KK-quant 5-bit x K-quant 8-bit
Q6_KQ8_KK-quant 6-bit x K-quant 8-bit
F16F16Half-precision weights x half-precision activations

When executing matrix multiplication, the backend looks up the weight tensor’s type in a table (type_traits_cpu[]) to find the corresponding vec_dot function and vec_dot_type. If the activation type doesn’t match vec_dot_type, a temporary buffer is allocated to perform type conversion first.

Summary

This chapter covered two foundational aspects of llama.cpp:

  1. Tool ecosystem: llama-completion (legacy, fine-grained control), llama-cli (next-gen, most feature-rich), and llama-bench (pure performance measurement) β€” three tools with distinct roles, chosen based on use case
  2. GGUF format: the five-section layout of Header -> KV Metadata -> Tensor Info -> Padding -> Data Blob, parsed in a single sequential read via gguf_init_from_file_ptr()
  3. Quantization blocks: struct definitions and dequantization formulas for Q4_0 (simple scale + 4-bit), Q4_K (two-level super-block/sub-block quantization), and Q8_0 (runtime intermediate format), plus the table-lookup mechanism for dot product pairing

The next article, Model Loading, traces how weights are loaded into GPU/CPU memory via mmap or buffer upload after GGUF file parsing is complete.