llama.cpp Execution Pipeline Overview

Setting the Stage

If you have already read Ollama Inference Journey, you understand the conceptual flow from ollama run to model output. That article stopped at the llama.cpp boundary — this series picks up from there, tracing the C/C++ implementation details function by function.

This is article #0 (overview) of the llama.cpp Source Code Deep Dive series. The goal is to build a global map: starting from a GGUF file on disk, what core stages does a token pass through before it becomes a character on your terminal? What is the bottleneck at each stage? And which of the seven subsequent articles covers which segment?

llama.cpp Project Overview

llama.cpp is a pure C/C++ LLM inference engine, positioned as a high-performance, cross-platform, easy-to-deploy local inference solution. Its key characteristics include:

Broad architecture support: supports 125 known model architectures (LLaMA, Qwen, Mistral, Gemma, DeepSeek, etc.), dispatched through a unified build_<arch>() interface
Multiple quantization formats: Q4_0, Q4_K, Q8_0, IQ2, and more, offering flexible trade-offs among model size, inference speed, and accuracy
Multi-backend heterogeneous computing: CUDA, Metal, Vulkan, OpenCL, CPU, and other backends, supporting layer-wise execution of a single model across multiple devices
Core design philosophy: striking a balance between generality and performance — covering as many models and hardware as possible while achieving near-optimal inference speed on each combination

End-to-End Execution Pipeline

The interactive diagram below shows the complete execution pipeline of llama.cpp. Click on any stage to see the specific steps it contains and the corresponding source code sections.

Click any phase to see details

🔵Startup (One-time)

GGUF FileModel LoadingBackend InitContext Init + Warmup

🟢Request Processing

User InputChat TemplateTokenization

🟡Prefill/Decode Loop

Batch ConstructionSplit into UbatchBuild GraphBackend SchedulingOp FusionTensor AllocationExecute SplitsWrite KV Cache

🔴Sampling & Output

Output LogitsSampling ChainGrammar ConstraintToken → Text OutputSpeculative Decoding (optional)Context Shift

Per-token loop

Complete Data Flow

From GGUF file to text output, a token’s full journey goes through the following 14 steps:

GGUF file is parsed to extract tensor metadata and quantized weights
Model loading sends weights to compute devices via mmap or buffer upload
Backend initialization detects available hardware and allocates Transformer layers
Warmup warms up the GPU and probes Flash Attention support
Tokenization converts human text into a token sequence
Batch/Ubatch organizes the token sequence into compute units
Build Graph constructs a compute graph for 125 different architectures
Backend Scheduling assigns the compute graph to multiple devices and splits it
Op Fusion each backend performs hardware-specific graph optimizations
Tensor Allocation minimizes VRAM usage through reference counting
Execution iterates over splits, copies data across devices, and computes asynchronously
Sampling selects the next token from the probability distribution
Speculative Decoding uses a small model to accelerate decoding by the large model
Context Management maintains the lifecycle of the KV cache

Steps 1-4 are a one-time startup phase; step 5 runs on each new request; steps 6-11 loop during the Prefill and Decode phases (Prefill processes the entire prompt, Decode processes only 1 new token at a time); steps 12-14 run after each Decode step.

Performance Characteristics by Stage

Stage	Bottleneck	Typical Latency
Model Loading	Disk I/O (mmap) or PCIe bandwidth (GPU upload)	Seconds
Warmup	GPU kernel compilation + VRAM allocation	Seconds
Prefill	Compute-bound (matrix multiplication), benefits from GPU parallelism	Proportional to prompt length
Decode	Memory-bound (KV cache reads), processes only 1 token at a time	Tens of milliseconds per token
Sampling	Runs on CPU, usually not a bottleneck	Microseconds
Context Shift	KV cache metadata operations + K-shift	Milliseconds

Prefill and Decode have fundamentally different performance characteristics: Prefill is compute-bound (large matrix multiplications that can be highly parallelized), while Decode is memory-bound (only a single token’s vector computes attention against the entire KV cache, reducing matrix multiplication to matrix-vector multiplication). This is why llama.cpp uses different threading strategies for the two (n_threads_batch vs n_threads).

This series consists of 8 articles, progressively diving from the overview into each subsystem:

#0 Execution Pipeline Overview (this article): end-to-end panorama, performance characteristics by stage, series roadmap
#1 Tool Landscape & GGUF Binary Parsing: llama.cpp’s tool ecosystem and field-by-field dissection of the GGUF format
#2 Model Loading: From File to Device: the no_alloc trick, mmap/read dual loading paths, GPU layer allocation algorithm
#3 Warmup, Tokenization & Chat Template: why warmup is needed, tokenizer implementation, Jinja2 template rendering
#4 Batch, Ubatch & the Decode Main Loop: two-level batch splitting algorithm, parallel sequence decoding
#5 Compute Graph Construction & Architecture Dispatch: the switch dispatch for 125 architectures, building-block interfaces, graph reuse
#6 Backend Scheduling, Op Fusion & Memory Allocation: five-pass scanning algorithm, fusion patterns, three tensor lifetime categories
#7 Execution, Sampling & Context Management: MoE selective copy, sampler chain, speculative decoding, KV cache management

Summary

This is not just an inference engine — it is a carefully designed system that balances generality (125 known architectures, multiple quantization formats, cross-platform backends) with performance (graph reuse, op fusion, pipeline parallelism, MoE selective copy, speculative decoding).