Compute Graph Construction & Architecture Dispatch
Updated 2026-04-15
Series context: This is article #5 in the llama.cpp source walkthrough series. llama.cpp’s compute graph is built on GGML’s lazy evaluation model. If you are not yet familiar with that model, consider reading Compute Graphs & Inference Engines first. This article dives straight into the C++ implementation of
build_graph().
Before each ubatch enters computation, a computation graph must be constructed — it describes all tensor operations and their dependency relationships. llama.cpp supports 125 different model architectures, each with its own graph topology, but they all share a common set of reusable “building block” interfaces.
Part A: 125 Architectures, One Entry Point
The llm_arch Enum
src/llama-arch.h defines the llm_arch enum, listing all supported model architectures:
enum llm_arch {
LLM_ARCH_CLIP,
LLM_ARCH_LLAMA,
LLM_ARCH_LLAMA4,
LLM_ARCH_FALCON,
LLM_ARCH_GPT2,
LLM_ARCH_GPTJ,
LLM_ARCH_GPTNEOX,
LLM_ARCH_BERT,
LLM_ARCH_QWEN2,
LLM_ARCH_GEMMA3,
// ... 125 known architectures total ...
LLM_ARCH_UNKNOWN,
};
Each architecture corresponds to a separate implementation file under src/models/ (113 .cpp files in total), such as llama.cpp, gpt2.cpp, qwen2.cpp, etc.
build_graph(): The Giant Switch Dispatch
All architectures converge at a single entry point — llama_model::build_graph() (located in src/llama-model.cpp):
ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
std::unique_ptr<llm_graph_context> llm;
switch (arch) {
case LLM_ARCH_LLAMA:
llm = std::make_unique<llm_build_llama<false>>(*this, params);
break;
case LLM_ARCH_GPT2:
llm = std::make_unique<llm_build_gpt2>(*this, params);
break;
case LLM_ARCH_QWEN2:
llm = std::make_unique<llm_build_qwen2>(*this, params);
break;
// ... 125 cases ...
default:
GGML_ABORT("fatal error");
}
// Post-processing shared by all architectures
llm->build_pooling(cls, cls_b, cls_out, cls_out_b, cls_norm);
llm->build_sampling();
llm->build_dense_out(dense_2_out_layers, ...);
llm->res->set_outputs();
return llm->res->get_gf();
}
The key insight: each case creates an architecture-specific graph builder (e.g., llm_build_llama, llm_build_gpt2), and the compute graph is constructed inside its constructor. After construction, build_graph() appends shared post-processing nodes such as pooling, sampling, etc.
Part B: The llm_graph_context Building Block Interface
All architecture-specific graph builders inherit from the llm_graph_context base class (defined in src/llama-graph.h). This base class provides a set of reusable “building block” methods. The inheritance hierarchy is as follows:
Core Building Block Methods
| Method | Function | Supported Variants |
|---|---|---|
build_norm() | Normalization | LLM_NORM (standard LayerNorm), LLM_NORM_RMS (RMSNorm) |
build_attn() | Attention | KV cache, no cache (encoder), ISWA (sliding window), cross-attention |
build_ffn() | Feed-forward network | SiLU, GELU, ReLU and other activations; parallel/sequential gate |
build_moe_ffn() | MoE FFN | Supports softmax/top-k gating, on-demand expert loading |
Each architecture’s graph builder calls these building block methods in its constructor, assembling the complete compute graph according to that architecture’s transformer structure. The building block methods use enum parameters (e.g., LLM_NORM_RMS, LLM_FFN_SILU, LLM_FFN_PAR) to select variants, so different architectures simply pass different parameters to reuse the same underlying implementation.
Part C: Concrete Comparison — Llama vs GPT-2
Comparing two classic architectures side by side gives an intuitive understanding of how the building blocks are composed.
Llama Architecture
The Llama graph builder in src/models/llama.cpp:
llm_build_llama::llm_build_llama(const llama_model & model,
const llm_graph_params & params)
: llm_graph_context(params) {
// Input layer
inpL = build_inp_embd(model.tok_embd);
auto * inp_pos = build_inp_pos();
auto * inp_attn = build_attn_inp_kv(); // KV cache attention
for (int il = 0; il < n_layer; ++il) {
// Attention branch
cur = build_norm(inpL, model.layers[il].attn_norm, NULL,
LLM_NORM_RMS, il); // ← RMSNorm
Qcur = build_lora_mm(model.layers[il].wq, cur, ...); // Q = Wx + LoRA
Kcur = build_lora_mm(model.layers[il].wk, cur, ...);
Vcur = build_lora_mm(model.layers[il].wv, cur, ...);
Qcur = ggml_rope_ext(ctx0, Qcur, inp_pos, ...); // ← RoPE positional encoding
Kcur = ggml_rope_ext(ctx0, Kcur, inp_pos, ...);
cur = build_attn(inp_attn, model.layers[il].wo, NULL,
Qcur, Kcur, Vcur, ..., kq_scale, il);
// Residual connection + FFN
ffn_inp = ggml_add(ctx0, cur, inpL);
cur = build_norm(ffn_inp, model.layers[il].ffn_norm, NULL,
LLM_NORM_RMS, il);
if (model.layers[il].ffn_gate_inp == nullptr) {
cur = build_ffn(cur, ..., LLM_FFN_SILU, LLM_FFN_PAR, il); // ← SiLU + parallel gate
} else {
cur = build_moe_ffn(cur, ..., LLM_FFN_SILU, ...); // MoE variant
}
inpL = ggml_add(ctx0, cur, ffn_inp); // Residual
}
// Output
cur = build_norm(inpL, model.output_norm, NULL, LLM_NORM_RMS, -1);
cur = build_lora_mm(model.output, cur); // logits
res->t_logits = cur;
ggml_build_forward_expand(gf, cur);
}
Llama’s hallmarks: RMSNorm normalization, RoPE rotary positional encoding (applied to Q and K), separate Q/K/V projections (three matrix multiplications), SiLU + parallel gate FFN, and support for MoE variants.
GPT-2 Architecture
The GPT-2 graph builder in src/models/gpt2.cpp — same building blocks, different composition:
llm_build_gpt2::llm_build_gpt2(const llama_model & model,
const llm_graph_params & params)
: llm_graph_context(params) {
inpL = build_inp_embd(model.tok_embd);
auto * inp_pos = build_inp_pos();
// GPT-2 uses learned positional embeddings, not RoPE
pos = ggml_get_rows(ctx0, model.pos_embd, inp_pos);
inpL = ggml_add(ctx0, inpL, pos);
auto * inp_attn = build_attn_inp_kv();
for (int il = 0; il < n_layer; ++il) {
cur = build_norm(inpL, model.layers[il].attn_norm,
model.layers[il].attn_norm_b,
LLM_NORM, il); // ← Standard LayerNorm (not RMSNorm)
// GPT-2 uses a fused QKV projection (single matrix multiplication)
cur = build_lora_mm(model.layers[il].wqkv, cur);
cur = ggml_add(ctx0, cur, model.layers[il].bqkv);
// Split QKV using views (zero-copy)
Qcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head, n_tokens, ...);
Kcur = ggml_view_3d(ctx0, cur, ...);
Vcur = ggml_view_3d(ctx0, cur, ...);
cur = build_attn(inp_attn, ..., 1.0f/sqrtf(float(n_embd_head)), il);
ffn_inp = ggml_add(ctx0, cur, inpL);
cur = build_norm(ffn_inp, ..., LLM_NORM, il); // ← Standard LayerNorm
cur = build_ffn(cur, ..., LLM_FFN_GELU, LLM_FFN_SEQ, il); // ← GELU + sequential gate
inpL = ggml_add(ctx0, cur, ffn_inp);
}
cur = build_norm(inpL, model.output_norm, model.output_norm_b, LLM_NORM, -1);
cur = build_lora_mm(model.output, cur);
res->t_logits = cur;
ggml_build_forward_expand(gf, cur);
}
GPT-2’s hallmarks: standard LayerNorm (with bias parameters), learned positional embeddings (added directly to input), fused QKV single projection (split via ggml_view_3d with zero copy), and GELU + sequential FFN.
Key Differences
| Feature | Llama | GPT-2 |
|---|---|---|
| Normalization | RMSNorm | Standard LayerNorm |
| Positional encoding | RoPE (Rotary Positional Encoding) | Learned positional embeddings |
| QKV projection | Separate Q, K, V projections | Fused QKV single projection |
| FFN activation | SiLU + parallel gate | GELU + sequential |
| Bias | Typically no bias | Has bias |
But from a code structure perspective, the two share an identical framework: input -> layer loop (norm -> attn -> residual -> norm -> ffn -> residual) -> output. This is the power of the building block design.
Interactive Comparison
The interactive component below displays the Llama and GPT-2 construction flows side by side, highlighting the steps that differ:
Architecture Comparison: Llama vs GPT-2
| Feature | Llama | GPT-2 |
|---|---|---|
| Normalization | RMSNorm | Standard LayerNorm |
| Position Encoding | RoPE (Rotary Position Embedding) | Learnable position embedding |
| QKV Projection | Separate Q, K, V projections | Merged QKV single projection |
| FFN Activation | SiLU + parallel gate | GELU + sequential |
| Bias | Typically no bias | Has bias |
Part D: Graph Reuse Mechanism
Constructing a compute graph has non-trivial overhead (creating ggml tensor objects, establishing dependency relationships), especially during the autoregressive decoding phase where each ubatch may contain only a few tokens and the graph topology is exactly the same. llama.cpp avoids redundant construction through a graph reuse mechanism.
Reuse Logic in process_ubatch()
Inside process_ubatch():
// src/llama-context.cpp (simplified)
llm_graph_result * llama_context::process_ubatch(
const llama_ubatch & ubatch, ...) {
auto * res = gf_res_prev.get(); // Previous graph result
const auto gparams = graph_params(res, ubatch, mctx, gtype);
if (!graph_reuse_disable && res->can_reuse(gparams)) {
// Graph topology unchanged — reuse directly! Only update input data
n_reused++;
} else {
// Need to rebuild the graph
res->reset();
ggml_backend_sched_reset(sched.get());
gf = model.build_graph(gparams); // Rebuild
ggml_backend_sched_alloc_graph(sched.get(), gf); // Reallocate
}
res->set_inputs(&ubatch); // Set new input data regardless of reuse
graph_compute(res->get_gf(), ubatch.n_tokens > 1); // Execute
return res;
}
Reuse Conditions
can_reuse() checks the following conditions:
- Same ubatch shape:
n_tokens,n_seq_tokens,n_seqsare identical - Same input mode: both are tokens or both are embeddings
- Unchanged sequence IDs: under equal_seqs mode
- Same number of outputs
- Unchanged model configuration:
causal_attn,embeddings, LoRA, etc.
During typical autoregressive decoding (generating 1 token at a time), consecutive ubatches usually have exactly the same shape, so the graph reuse rate is very high.
Summary
This article walked through the complete path of compute graph construction in llama.cpp:
- One entry point:
build_graph()dispatches 125 architectures to their respective graph builders via a giant switch statement - Shared building blocks: All graph builders inherit from
llm_graph_context, reusing methods likebuild_norm/build_attn/build_ffn - Flexible composition: Different architectures select normalization types, activation functions, projection strategies, and other variants by passing different enum parameters
- Graph reuse: During the autoregressive decoding phase, ubatches with unchanged shapes can skip graph construction and directly reuse the previous compute graph
The next article covers #6 Backend Scheduling & Memory Management, exploring how the constructed compute graph is assigned to CPU/GPU backends and executed.