llama.cpp Execution Pipeline Overview
Updated 2026-04-15
Setting the Stage
If you have already read Ollama Inference Journey, you understand the conceptual flow from ollama run to model output. That article stopped at the llama.cpp boundary — this series picks up from there, tracing the C/C++ implementation details function by function.
This is article #0 (overview) of the llama.cpp Source Code Deep Dive series. The goal is to build a global map: starting from a GGUF file on disk, what core stages does a token pass through before it becomes a character on your terminal? What is the bottleneck at each stage? And which of the seven subsequent articles covers which segment?
llama.cpp Project Overview
llama.cpp is a pure C/C++ LLM inference engine, positioned as a high-performance, cross-platform, easy-to-deploy local inference solution. Its key characteristics include:
- Broad architecture support: supports 125 known model architectures (LLaMA, Qwen, Mistral, Gemma, DeepSeek, etc.), dispatched through a unified
build_<arch>()interface - Multiple quantization formats: Q4_0, Q4_K, Q8_0, IQ2, and more, offering flexible trade-offs among model size, inference speed, and accuracy
- Multi-backend heterogeneous computing: CUDA, Metal, Vulkan, OpenCL, CPU, and other backends, supporting layer-wise execution of a single model across multiple devices
- Core design philosophy: striking a balance between generality and performance — covering as many models and hardware as possible while achieving near-optimal inference speed on each combination
End-to-End Execution Pipeline
The interactive diagram below shows the complete execution pipeline of llama.cpp. Click on any stage to see the specific steps it contains and the corresponding source code sections.
Click any phase to see details
Complete Data Flow
From GGUF file to text output, a token’s full journey goes through the following 14 steps:
- GGUF file is parsed to extract tensor metadata and quantized weights
- Model loading sends weights to compute devices via mmap or buffer upload
- Backend initialization detects available hardware and allocates Transformer layers
- Warmup warms up the GPU and probes Flash Attention support
- Tokenization converts human text into a token sequence
- Batch/Ubatch organizes the token sequence into compute units
- Build Graph constructs a compute graph for 125 different architectures
- Backend Scheduling assigns the compute graph to multiple devices and splits it
- Op Fusion each backend performs hardware-specific graph optimizations
- Tensor Allocation minimizes VRAM usage through reference counting
- Execution iterates over splits, copies data across devices, and computes asynchronously
- Sampling selects the next token from the probability distribution
- Speculative Decoding uses a small model to accelerate decoding by the large model
- Context Management maintains the lifecycle of the KV cache
Steps 1-4 are a one-time startup phase; step 5 runs on each new request; steps 6-11 loop during the Prefill and Decode phases (Prefill processes the entire prompt, Decode processes only 1 new token at a time); steps 12-14 run after each Decode step.
Performance Characteristics by Stage
| Stage | Bottleneck | Typical Latency |
|---|---|---|
| Model Loading | Disk I/O (mmap) or PCIe bandwidth (GPU upload) | Seconds |
| Warmup | GPU kernel compilation + VRAM allocation | Seconds |
| Prefill | Compute-bound (matrix multiplication), benefits from GPU parallelism | Proportional to prompt length |
| Decode | Memory-bound (KV cache reads), processes only 1 token at a time | Tens of milliseconds per token |
| Sampling | Runs on CPU, usually not a bottleneck | Microseconds |
| Context Shift | KV cache metadata operations + K-shift | Milliseconds |
Prefill and Decode have fundamentally different performance characteristics: Prefill is compute-bound (large matrix multiplications that can be highly parallelized), while Decode is memory-bound (only a single token’s vector computes attention against the entire KV cache, reducing matrix multiplication to matrix-vector multiplication). This is why llama.cpp uses different threading strategies for the two (n_threads_batch vs n_threads).
Series Navigation
This series consists of 8 articles, progressively diving from the overview into each subsystem:
- #0 Execution Pipeline Overview (this article): end-to-end panorama, performance characteristics by stage, series roadmap
- #1 Tool Landscape & GGUF Binary Parsing: llama.cpp’s tool ecosystem and field-by-field dissection of the GGUF format
- #2 Model Loading: From File to Device: the no_alloc trick, mmap/read dual loading paths, GPU layer allocation algorithm
- #3 Warmup, Tokenization & Chat Template: why warmup is needed, tokenizer implementation, Jinja2 template rendering
- #4 Batch, Ubatch & the Decode Main Loop: two-level batch splitting algorithm, parallel sequence decoding
- #5 Compute Graph Construction & Architecture Dispatch: the switch dispatch for 125 architectures, building-block interfaces, graph reuse
- #6 Backend Scheduling, Op Fusion & Memory Allocation: five-pass scanning algorithm, fusion patterns, three tensor lifetime categories
- #7 Execution, Sampling & Context Management: MoE selective copy, sampler chain, speculative decoding, KV cache management
Summary
This is not just an inference engine — it is a carefully designed system that balances generality (125 known architectures, multiple quantization formats, cross-platform backends) with performance (graph reuse, op fusion, pipeline parallelism, MoE selective copy, speculative decoding).
The next seven articles will dissect the source code implementation of each stage one by one. Reading in series order is recommended, as later articles reference concepts established in earlier ones. If you are particularly interested in a specific subsystem (such as sampling or KV cache), you can also jump directly to it — each article is self-contained and will recap prerequisite knowledge where necessary.