Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

llama.cpp Execution Pipeline Overview

llama.cpp Execution Pipeline Overview

Updated 2026-04-15

Setting the Stage

If you have already read Ollama Inference Journey, you understand the conceptual flow from ollama run to model output. That article stopped at the llama.cpp boundary — this series picks up from there, tracing the C/C++ implementation details function by function.

This is article #0 (overview) of the llama.cpp Source Code Deep Dive series. The goal is to build a global map: starting from a GGUF file on disk, what core stages does a token pass through before it becomes a character on your terminal? What is the bottleneck at each stage? And which of the seven subsequent articles covers which segment?

llama.cpp Project Overview

llama.cpp is a pure C/C++ LLM inference engine, positioned as a high-performance, cross-platform, easy-to-deploy local inference solution. Its key characteristics include:

  • Broad architecture support: supports 125 known model architectures (LLaMA, Qwen, Mistral, Gemma, DeepSeek, etc.), dispatched through a unified build_<arch>() interface
  • Multiple quantization formats: Q4_0, Q4_K, Q8_0, IQ2, and more, offering flexible trade-offs among model size, inference speed, and accuracy
  • Multi-backend heterogeneous computing: CUDA, Metal, Vulkan, OpenCL, CPU, and other backends, supporting layer-wise execution of a single model across multiple devices
  • Core design philosophy: striking a balance between generality and performance — covering as many models and hardware as possible while achieving near-optimal inference speed on each combination

End-to-End Execution Pipeline

The interactive diagram below shows the complete execution pipeline of llama.cpp. Click on any stage to see the specific steps it contains and the corresponding source code sections.

Click any phase to see details

🔵Startup (One-time)
GGUF FileModel LoadingBackend InitContext Init + Warmup
🟢Request Processing
User InputChat TemplateTokenization
🟡Prefill/Decode Loop
Batch ConstructionSplit into UbatchBuild GraphBackend SchedulingOp FusionTensor AllocationExecute SplitsWrite KV Cache
🔴Sampling & Output
Output LogitsSampling ChainGrammar ConstraintToken → Text OutputSpeculative Decoding (optional)Context Shift
Per-token loop

Complete Data Flow

From GGUF file to text output, a token’s full journey goes through the following 14 steps:

  1. GGUF file is parsed to extract tensor metadata and quantized weights
  2. Model loading sends weights to compute devices via mmap or buffer upload
  3. Backend initialization detects available hardware and allocates Transformer layers
  4. Warmup warms up the GPU and probes Flash Attention support
  5. Tokenization converts human text into a token sequence
  6. Batch/Ubatch organizes the token sequence into compute units
  7. Build Graph constructs a compute graph for 125 different architectures
  8. Backend Scheduling assigns the compute graph to multiple devices and splits it
  9. Op Fusion each backend performs hardware-specific graph optimizations
  10. Tensor Allocation minimizes VRAM usage through reference counting
  11. Execution iterates over splits, copies data across devices, and computes asynchronously
  12. Sampling selects the next token from the probability distribution
  13. Speculative Decoding uses a small model to accelerate decoding by the large model
  14. Context Management maintains the lifecycle of the KV cache

Steps 1-4 are a one-time startup phase; step 5 runs on each new request; steps 6-11 loop during the Prefill and Decode phases (Prefill processes the entire prompt, Decode processes only 1 new token at a time); steps 12-14 run after each Decode step.

Performance Characteristics by Stage

StageBottleneckTypical Latency
Model LoadingDisk I/O (mmap) or PCIe bandwidth (GPU upload)Seconds
WarmupGPU kernel compilation + VRAM allocationSeconds
PrefillCompute-bound (matrix multiplication), benefits from GPU parallelismProportional to prompt length
DecodeMemory-bound (KV cache reads), processes only 1 token at a timeTens of milliseconds per token
SamplingRuns on CPU, usually not a bottleneckMicroseconds
Context ShiftKV cache metadata operations + K-shiftMilliseconds

Prefill and Decode have fundamentally different performance characteristics: Prefill is compute-bound (large matrix multiplications that can be highly parallelized), while Decode is memory-bound (only a single token’s vector computes attention against the entire KV cache, reducing matrix multiplication to matrix-vector multiplication). This is why llama.cpp uses different threading strategies for the two (n_threads_batch vs n_threads).

Series Navigation

This series consists of 8 articles, progressively diving from the overview into each subsystem:

Summary

This is not just an inference engine — it is a carefully designed system that balances generality (125 known architectures, multiple quantization formats, cross-platform backends) with performance (graph reuse, op fusion, pipeline parallelism, MoE selective copy, speculative decoding).

The next seven articles will dissect the source code implementation of each stage one by one. Reading in series order is recommended, as later articles reference concepts established in earlier ones. If you are particularly interested in a specific subsystem (such as sampling or KV cache), you can also jump directly to it — each article is self-contained and will recap prerequisite knowledge where necessary.