Tool Landscape and GGUF Binary Parsing
Updated 2026-04-15
Series context: This is article #1 in the llama.cpp source code deep-dive series, focusing on the llama.cpp tool ecosystem and the C++ implementation details of the GGUF format. If you havenβt read the series overview yet, we recommend building the big picture first before diving into this chapter.
Part A: Tool Landscape
llama.cpp provides several executables under its tools/ directory. The three most important are llama-completion, llama-cli, and llama-bench, each targeting different use cases with distinct architectures.
Architecture Differences
llama-completion is the original text generation tool (formerly named main). It directly calls the low-level C APIs from llama.h and sampling.h, manually managing context, KV cache, and sampling. The code structure is straightforward, making it ideal for understanding llama.cppβs low-level mechanics.
llama-cli is the next-generation interactive chat client. It internally embeds a server instance (pulling in server-context.h and server-task.h), using a task/response reader pattern to process requests asynchronously. It features an ASCII logo, spinner loading animation, and supports multimodal inputs (images/audio) and speculative decoding.
llama-bench is a pure performance benchmarking tool. It doesnβt generate meaningful text β it only measures prompt processing (pp) and token generation (tg) throughput, outputting structured performance data.
Feature Comparison
| Feature | llama-completion | llama-cli | llama-bench |
|---|---|---|---|
| Text generation / completion | β | β | β |
| Interactive chat | β Basic | β Full | β |
| Prompt Cache | β
--prompt-cache | β | β |
| Self-Extend | β
--grp-attn-n/w | β | β |
--in-prefix/suffix | β | β | β |
| Speculative Decoding | β | β
--draft | β |
| Multimodal (image/audio) | β | β
--image/audio | β |
| Grammar / JSON Schema | β | β | β |
| All sampling parameters | β | β | β |
| Performance measurement (pp/tg t/s) | β | β | β |
Note:
--in-prefix/suffixis only registered underLLAMA_EXAMPLE_COMPLETION;--grp-attn-nis registered underLLAMA_EXAMPLE_COMPLETIONandLLAMA_EXAMPLE_PASSKEY, while--grp-attn-wis only registered underLLAMA_EXAMPLE_COMPLETION(seecommon/arg.cpp). Neither llama-cli nor llama-bench supports these flags.
Recommendations
- Everyday chat ->
llama-cli: best UX, latest features, supports multimodal and speculative decoding - Fine-grained control / scripted batch processing ->
llama-completion: prompt cache, Self-Extend, custom prefix/suffix - Hardware / backend performance evaluation ->
llama-bench: pure pp and tg throughput testing
Part B: GGUF Field-by-Field Parsing
llama.cpp uses GGUF (GGML Universal File Format) as its model file format. Understanding the GGUF structure is the foundation for understanding all subsequent loading workflows.
Physical File Layout
A GGUF file is laid out sequentially from start to end as follows:
Key takeaway: you only need to read the header to get all tensor metadata (name, shape, type) β no need to read the data blob. For a 7B model, the header is typically just a few MB, while the data blob is several GB.
Parsing Flow
The GGUF parsing entry point is gguf_init_from_file_ptr() (ggml/src/gguf.cpp), which reads in the following order:
// 1. Validate magic
char magic[4];
read(magic); // Must be "GGUF"
// 2. Read version number
uint32_t version;
read(version); // Currently supports v2, v3
// 3. Read tensor count and KV count
int64_t n_tensors, n_kv;
read(n_tensors);
read(n_kv);
// 4. Parse all KV metadata
for (int64_t i = 0; i < n_kv; i++) {
// Read key (string) + value type + value
}
// 5. Parse all tensor info (does NOT read data!)
for (int64_t i = 0; i < n_tensors; i++) {
read(name); // Tensor name
read(n_dims); // Number of dimensions
read(ne[0..3]); // Element count per dimension (shape)
read(type); // Quantization type: Q4_0, Q4_K, F16, etc.
// Compute stride from type
nb[0] = ggml_type_size(type);
nb[1] = nb[0] * (ne[0] / ggml_blck_size(type));
read(offset); // Offset within the data blob
}
// 6. Seek to data section start (after alignment padding)
fseek(file, GGML_PAD(current_pos, alignment), SEEK_SET);
ctx->offset = ftell(file); // Record data section start offset
Note that in step 5, each tensor info entry contains only metadata, not the actual weight data. After parsing all tensor info entries, the entire header is complete. Whether the data blob is subsequently read depends on the no_alloc parameter passed by the caller.
Tensor Info Structure
Each tensorβs metadata corresponds to the following C structures:
// gguf.cpp
struct gguf_tensor_info {
struct ggml_tensor t; // Contains type, ne[4], nb[4], name
uint64_t offset; // Byte offset within the data blob
};
// ggml.h (simplified)
struct ggml_tensor {
enum ggml_type type; // Quantization type: GGML_TYPE_Q4_K, GGML_TYPE_F16, etc.
int64_t ne[GGML_MAX_DIMS]; // Shape: element count per dimension
size_t nb[GGML_MAX_DIMS]; // Byte stride: bytes per step in each dimension
char name[GGML_MAX_NAME]; // Tensor name
void * data; // Data pointer (NULL during header parsing)
};
KV Metadata
KV metadata stores various meta-information about the model. Common keys include:
| Key | Type | Description |
|---|---|---|
general.architecture | string | Model architecture name, e.g. "llama", "qwen3" |
general.name | string | Model display name |
tokenizer.chat_template | string | Jinja2 chat template (covered in detail in later chapters) |
{arch}.context_length | uint32 | Maximum context length used during training |
{arch}.embedding_length | uint32 | Embedding dimension |
{arch}.block_count | uint32 | Number of transformer layers |
{arch}.attention.head_count | uint32 | Number of attention heads |
When loading a model, llama.cpp first reads general.architecture to determine the model architecture, then reads the corresponding hyperparameters based on that.
Part C: Quantization Block Structure
The quantization type in GGUF determines how weights are stored. Each quantization type defines a fixed-size block containing a group of quantized weight values along with the scale/min parameters needed for dequantization.
Q4_0: The Simplest 4-bit Quantization
#define QK4_0 32 // 32 elements per block
typedef struct {
ggml_half d; // delta (scale), f16 format
uint8_t qs[QK4_0 / 2]; // 16 bytes: each byte stores 2 4-bit quantized values
} block_q4_0;
// sizeof = 2 + 16 = 18 bytes -> 32 weights -> 4.5 bits/weight
Dequantization formula: , where is a 4-bit unsigned integer (0~15), subtracting 8 to obtain a signed value.
Q4_K: K-quant Series 4-bit Quantization
#define QK_K 256 // super-block: 256 elements
#define K_SCALE_SIZE 12 // bytes for scales
typedef struct {
union {
struct {
ggml_half d; // super-block scale (for quantizing scales)
ggml_half dmin; // super-block scale (for quantizing mins)
};
ggml_half2 dm;
};
uint8_t scales[K_SCALE_SIZE]; // 12 bytes: scale and min for 8 sub-blocks, 6-bit quantized
uint8_t qs[QK_K/2]; // 128 bytes: 256 4-bit quantized values
} block_q4_K;
// sizeof = 4 + 12 + 128 = 144 bytes -> 256 weights -> 4.5 bits/weight
Q4_K uses a two-level quantization structure:
Dequantization formula: , where raw_scale and raw_min are decoded from the 6-bit encoding in scales[]. The term is subtracted as a zero-point offset.
Q8_0: 8-bit Quantization
#define QK8_0 32 // 32 elements per block
typedef struct {
ggml_half d; // delta (scale), f16 format
int8_t qs[QK8_0]; // 32 bytes: 32 int8 quantized values
} block_q8_0;
// sizeof = 2 + 32 = 34 bytes -> 32 weights -> 8.5 bits/weight
Q8_0 is typically not used directly as a model storage format. Instead, it serves as a runtime intermediate format β during dot product execution, activations are quantized to Q8_0/Q8_K to compute dot products with the weights.
Dot Product Pairing
When performing quantized matrix multiplication, the quantization types of the two operands need to be paired:
| Weight Type | Paired Dot Product Type | Description |
|---|---|---|
| Q4_0 | Q8_0 | 4-bit weights x 8-bit activations |
| Q4_K | Q8_K | K-quant 4-bit x K-quant 8-bit |
| Q5_K | Q8_K | K-quant 5-bit x K-quant 8-bit |
| Q6_K | Q8_K | K-quant 6-bit x K-quant 8-bit |
| F16 | F16 | Half-precision weights x half-precision activations |
When executing matrix multiplication, the backend looks up the weight tensorβs type in a table (type_traits_cpu[]) to find the corresponding vec_dot function and vec_dot_type. If the activation type doesnβt match vec_dot_type, a temporary buffer is allocated to perform type conversion first.
Summary
This chapter covered two foundational aspects of llama.cpp:
- Tool ecosystem:
llama-completion(legacy, fine-grained control),llama-cli(next-gen, most feature-rich), andllama-bench(pure performance measurement) β three tools with distinct roles, chosen based on use case - GGUF format: the five-section layout of Header -> KV Metadata -> Tensor Info -> Padding -> Data Blob, parsed in a single sequential read via
gguf_init_from_file_ptr() - Quantization blocks: struct definitions and dequantization formulas for Q4_0 (simple scale + 4-bit), Q4_K (two-level super-block/sub-block quantization), and Q8_0 (runtime intermediate format), plus the table-lookup mechanism for dot product pairing
The next article, Model Loading, traces how weights are loaded into GPU/CPU memory via mmap or buffer upload after GGUF file parsing is complete.