Compute Graphs and Inference Engines

GGML Compute Graph

GGML (Georgi Gerganov Machine Learning) is a tensor library designed specifically for efficient LLM inference on CPUs and GPUs, serving as the core of llama.cpp. Unlike training frameworks such as PyTorch, GGML is fully optimized for inference scenarios, with the following key features:

Core Concepts:

Tensor: Multi-dimensional arrays supporting multiple quantization types including F32, F16, Q4_0, Q8_0, etc.
Graph: A directed acyclic graph (DAG) composed of operation nodes (Op Nodes) and data dependency edges
Context: A memory pool from which all tensors are allocated, with unified lifecycle management
Backend: An abstraction layer supporting multiple hardware targets including CPU, CUDA, Metal, Vulkan, etc.

Lazy Evaluation:

GGML employs a “build first, execute later” two-phase model. When you call ggml_mul_mat(A, B), GGML does not compute immediately. Instead, it:

Creates a MatMul node
Records the dependency relationships of inputs A and B
Returns an “not yet computed” output tensor

The entire graph is only scheduled for execution when ggml_graph_compute() is called. This design has two major benefits:

Global Optimization: The entire graph can be analyzed before execution for operator fusion, memory reuse, and other optimizations
Cross-Device Scheduling: Different nodes can be assigned to CPU/GPU/NPU based on hardware capabilities

Static vs Dynamic Graph:

GGML’s compute graph is semi-static:

Static Structure: Once the model architecture (number of layers, dimensions) is fixed, the graph topology does not change
Dynamic Input: The number of input tokens (seq_len) can vary between inference runs, and the KV Cache also expands dynamically

This differs from both PyTorch (fully dynamic graph) and TensorRT (fully static graph). GGML maintains flexibility while allowing optimization of fixed patterns.

Graph Construction Process

Using a Qwen3-8B Transformer Block as an example, let’s see how GGML builds the compute graph step by step. Model configuration:

hidden_size: 4096
num_attention_heads: 32 (Q)
num_key_value_heads: 8 (K/V, GQA)
intermediate_size: 11008 (FFN)
Activation function: SwiGLU

Pseudocode:

// 1. Create input tensor (assuming seq_len=10)
struct ggml_tensor* input = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 4096, 10);

// 2. RMSNorm before Attention
struct ggml_tensor* attn_norm = ggml_rms_norm(ctx, input);

// 3. QKV Projection (GQA: Q=4096, K/V=512)
struct ggml_tensor* Q = ggml_mul_mat(ctx, wq, attn_norm); // (10, 4096)
struct ggml_tensor* K = ggml_mul_mat(ctx, wk, attn_norm); // (10, 512)
struct ggml_tensor* V = ggml_mul_mat(ctx, wv, attn_norm); // (10, 512)

// 4. RoPE Positional Encoding
Q = ggml_rope_inplace(ctx, Q, freq_base, freq_scale);
K = ggml_rope_inplace(ctx, K, freq_base, freq_scale);

// 5. FlashAttention (fused QK^T, softmax, ×V)
struct ggml_tensor* attn_out = ggml_flash_attn(ctx, Q, K, V, true);

// 6. Attention Output Projection
attn_out = ggml_mul_mat(ctx, wo, attn_out);

// 7. Residual Connection
struct ggml_tensor* h = ggml_add(ctx, input, attn_out);

// 8. RMSNorm before FFN
struct ggml_tensor* ffn_norm = ggml_rms_norm(ctx, h);

// 9. SwiGLU: (Gate × Up) fusion
struct ggml_tensor* gate = ggml_mul_mat(ctx, w_gate, ffn_norm);
struct ggml_tensor* up = ggml_mul_mat(ctx, w_up, ffn_norm);
struct ggml_tensor* ffn_hidden = ggml_mul(ctx, ggml_silu(ctx, gate), up);

// 10. FFN Down Projection + Residual
struct ggml_tensor* ffn_out = ggml_mul_mat(ctx, w_down, ffn_hidden);
struct ggml_tensor* output = ggml_add(ctx, h, ffn_out);

Visual construction process:

Step 1: Input Tensor

Each blue/orange box represents a GGML operation node, and gray arrows indicate data dependencies. Note:

RMSNorm is the normalization method used by Qwen3 (replacing LayerNorm)
GQA (Grouped Query Attention): Q has 32 heads, K/V have only 8 heads, saving KV Cache memory
Residual connections (blue nodes): Add the input directly to the output to prevent gradient vanishing

Operator Fusion

After building the graph, GGML performs Operator Fusion optimization. In the traditional approach, each operation requires:

Reading input from HBM (High Bandwidth Memory)
Computing on GPU/CPU
Writing results back to HBM

When multiple operations execute consecutively, intermediate results are repeatedly read from and written to HBM, creating a massive memory bandwidth bottleneck. Operator fusion merges multiple operations into a single kernel, keeping intermediate results in registers or SRAM, dramatically reducing memory accesses.

Common fusion patterns:

1. FlashAttention Fusion:

The original Attention requires 5 independent steps:

scores = Q @ K.T          # (seq, seq) matrix multiplication
scores = scores / sqrt(d) # Scale
scores = scores + mask    # Causal Mask
attn = softmax(scores)    # Softmax
output = attn @ V         # Final matrix multiplication

FlashAttention fuses all steps into a single CUDA kernel, and additionally:

Tiled computation: Processes one small tile at a time, reducing SRAM usage
Online softmax: No need to store the complete scores matrix
IO optimization: Intermediate results are not written back to HBM

Result: Memory access reduced from O(N²) to O(N), with 2-4x speed improvement.

2. RMSNorm + MatMul Fusion:

Before Attention/FFN, normalization is always followed by a linear transformation:

x = rms_norm(x, weight_norm);
x = matmul(weight_proj, x);

After fusion, normalization can be completed during MatMul, eliminating one round of global read/write.

3. SwiGLU Fusion:

The SwiGLU activation function involves 4 operations:

gate = linear_gate(x)
up = linear_up(x)
activated = silu(gate) * up

The fused fused_swiglu(x, w_gate, w_up) completes all computations within a single kernel, avoiding 3 intermediate result stores.

Pattern Matching:

GGML automatically identifies fusible patterns through its Graph Optimizer. For example:

Detecting a rms_norm → mul_mat sequence → replaces with fused_rmsnorm_matmul
Detecting rope → rope → flash_attn → inlines RoPE into FlashAttention

This process is completed before ggml_graph_compute() and is transparent to the user.

ollamarunner vs llamarunner

Ollama supports two inference engines that build compute graphs in completely different ways:

ollamarunner (Go native, new direction):

Language: Go, minimizing CGo calls
Graph construction: Hand-written Forward() methods for each model architecture in Go, directly calling the GGML C API
Execution mode: Pipeline async execution, prefill and decode can run in parallel
Supported architectures: ~21 (Llama 3, Qwen3, Gemma 2, Mistral, …)
Advantages: Greater performance optimization potential, reduced CGo overhead, more aggressive batching possible
Disadvantages: Each new architecture requires hand-written Go code

llamarunner (C++ binding, fallback):

Language: C++, bound to llama.cpp’s C API through CGo
Graph construction: Completely relies on llama.cpp’s llama_decode(), black-box invocation
Execution mode: Synchronous execution, waits for current batch to complete before submitting the next
Supported architectures: ~120+ (covers all models supported by llama.cpp)
Advantages: Strongest compatibility, no need for the Ollama team to maintain model code
Disadvantages: High CGo call overhead, difficult to implement custom optimizations

Why two runners?

Ollama’s strategy is:

ollamarunner first: For mainstream models (Llama, Qwen, Mistral), hand-written Go implementations pursuing peak performance
llamarunner fallback: For long-tail models (Phi, CodeLlama, various fine-tuned models), relying on llama.cpp’s broad support

Users don’t need to worry about which runner is used — Ollama automatically selects based on the model file’s architecture field. Both runners ultimately submit compute graphs to the same GGML Backend, so hardware acceleration (CUDA, Metal, Vulkan) implementations are shared.

Architecture evolution:

v0.1.x - v0.2.x: Only llamarunner, fully dependent on llama.cpp
v0.3.x: Introduced ollamarunner, supporting Llama 3/3.1
v0.4.x: ollamarunner supports Qwen3, Gemma 2, Phi, etc.
Future: Plans to migrate more models to ollamarunner, and potentially implement their own backend (decoupled from GGML)

Backend Scheduling

After building the compute graph and completing operator fusion, GGML’s Scheduler is responsible for assigning operations to different hardware backends for execution.

Backend Abstraction:

GGML defines a unified Backend interface:

struct ggml_backend {
    const char* name;          // "CPU", "CUDA", "Metal"
    ggml_backend_buffer_t (*alloc_buffer)(size_t);
    void (*compute_graph)(ggml_backend_t, struct ggml_cgraph*);
    bool (*supports_op)(enum ggml_op);
};

Each backend implements three core capabilities:

Memory allocation: Allocating tensor storage in CPU RAM / GPU VRAM
Graph execution: Traversing graph nodes and calling corresponding kernel implementations (cuBLAS, cuDNN, Metal Performance Shaders, …)
Capability query: Reporting which operations are supported (e.g., CPU does not support FlashAttention)

Device Assignment:

Ollama supports three modes:

CPU Only: All operations on CPU, using AVX2/AVX-512 SIMD instructions
GPU Only: All operations on GPU (CUDA/Metal/Vulkan)
Hybrid: Model layers distributed across multiple devices

In Hybrid mode, GGML automatically assigns based on memory capacity:

Layer  0 -  10: GPU 0
Layer 11 -  20: GPU 1
Layer 21 -  32: CPU

Cross-Device Data Transfer:

When a node’s inputs come from different devices, GGML automatically inserts transfer nodes:

// Layer 10 (GPU) → Layer 11 (GPU)
tensor_gpu = ggml_mul_mat(gpu_backend, w10, x10);

// Layer 11 (GPU) → Layer 12 (CPU)
tensor_cpu = ggml_backend_copy(gpu_backend, cpu_backend, tensor_gpu);
x12 = ggml_mul_mat(cpu_backend, w12, tensor_cpu);

Transfers use:

CUDA → CPU: cudaMemcpyAsync + pinned memory
CPU → CUDA: cudaMemcpyAsync + page-locked memory
GPU → GPU: NVLink (if available) or PCIe

Multi-GPU Scheduling:

In multi-GPU environments, GGML supports two parallelism strategies:

Tensor Parallelism: Splitting a single MatMul across multiple GPUs

Q @ K^T  →  [Q1 @ K1^T] GPU0
            [Q2 @ K2^T] GPU1

Pipeline Parallelism: Assigning different layers to different GPUs
```
Layer 0-15: GPU 0
Layer 16-31: GPU 1
```

Ollama uses Pipeline Parallelism by default, as it is simpler to implement with lower communication overhead.

How It Differs

GGML’s design philosophy differs significantly from other mainstream frameworks. Let’s compare:

vs PyTorch (Dynamic Graph Training Framework):

Feature	GGML	PyTorch
Graph Type	Semi-static (fixed structure, variable input)	Fully dynamic (Eager Execution)
Target Scenario	Inference	Training + Inference
Memory Management	Context pool allocation, deterministic lifecycle	Dynamic allocation + GC, requires reference counting
Operator Fusion	Automatic compile-time fusion	Requires manual TorchScript/torch.jit
Quantization Support	Native Q4/Q8 inference	Requires ONNX Runtime or third-party libraries
Hardware Support	Extreme CPU optimization (AVX-512, Neon)	Average CPU performance, primarily GPU-optimized

PyTorch’s advantage lies in flexibility (dynamic control flow, immediate debugging), but inference performance is inferior to GGML. Typical comparison:

Llama 3-8B inference (CPU, 1 thread): GGML ~15 tok/s, PyTorch ~3 tok/s
Memory usage (Q4_K_M): GGML ~4.5GB, PyTorch (FP16) ~16GB

vs TensorRT (NVIDIA Inference Engine):

Feature	GGML	TensorRT
Graph Type	Semi-static	Fully static (fixed after compilation)
Deployment Flow	Load GGUF directly → run	Export ONNX → compile → serialize → run
Hardware Dependency	Cross-platform (CPU/CUDA/Metal/…)	NVIDIA GPU only
Quantization Granularity	Per-tensor or per-channel	Per-channel + INT8 calibration
Optimization Level	Operator fusion + kernel selection	Operator fusion + kernel tuning + layer fusion + precision calibration

TensorRT delivers the best performance on NVIDIA GPUs (specifically optimized for Tensor Core), but:

Long compilation time: 5-30 minutes for the first run
No dynamic input support: batch_size and seq_len must be fixed in advance
Complex deployment: Requires ONNX export + calibration dataset

GGML sacrifices some performance in exchange for:

Instant startup: No compilation needed, load model and start inference directly
Cross-platform: Same GGUF file runs on Mac/Windows/Linux
Flexible input: Arbitrary-length input sequences

vs ONNX Runtime (Cross-Framework Inference):

Feature	GGML	ONNX Runtime
Model Format	GGUF (custom binary)	ONNX (Protobuf)
Quantization Method	GGML native quantization (Q4_K_M, IQ3_XS, …)	INT8/INT4 PTQ + QDQ nodes
Backend Selection	Fixed at compile time (CPU/CUDA/Metal)	Runtime selection (EP: CPU/CUDA/TensorRT/OpenVINO/…)
Memory Optimization	KV Cache reuse + prefix cache	Depends on backend implementation

ONNX Runtime’s advantage is universality (supporting 100+ backends), but it is less focused than GGML for LLM inference. For example:

ONNX Runtime’s INT4 quantization requires exporting per-channel scales, while GGML supports more complex block-wise quantization
ONNX Runtime’s KV Cache management must be implemented manually by the user, whereas GGML has it built-in

Summary:

GGML’s core competitive advantages are:

Extreme CPU performance: Usable even without a GPU
Deep integration of quantization and inference: GGUF format natively supports low-bit inference
Simple deployment workflow: One file + one command, no complex compilation/optimization needed

This makes GGML (and Ollama) the best choice for edge device and local deployment scenarios. For peak performance, TensorRT or vLLM may be more suitable; for unified training and inference, PyTorch is the better choice. But if the goal is “easily run LLMs on any device,” GGML is currently unmatched.