Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Compute Graph Construction & Architecture Dispatch

Compute Graph Construction & Architecture Dispatch

Updated 2026-04-15

Series context: This is article #5 in the llama.cpp source walkthrough series. llama.cpp’s compute graph is built on GGML’s lazy evaluation model. If you are not yet familiar with that model, consider reading Compute Graphs & Inference Engines first. This article dives straight into the C++ implementation of build_graph().

Before each ubatch enters computation, a computation graph must be constructed — it describes all tensor operations and their dependency relationships. llama.cpp supports 125 different model architectures, each with its own graph topology, but they all share a common set of reusable “building block” interfaces.


Part A: 125 Architectures, One Entry Point

The llm_arch Enum

src/llama-arch.h defines the llm_arch enum, listing all supported model architectures:

enum llm_arch {
    LLM_ARCH_CLIP,
    LLM_ARCH_LLAMA,
    LLM_ARCH_LLAMA4,
    LLM_ARCH_FALCON,
    LLM_ARCH_GPT2,
    LLM_ARCH_GPTJ,
    LLM_ARCH_GPTNEOX,
    LLM_ARCH_BERT,
    LLM_ARCH_QWEN2,
    LLM_ARCH_GEMMA3,
    // ... 125 known architectures total ...
    LLM_ARCH_UNKNOWN,
};

Each architecture corresponds to a separate implementation file under src/models/ (113 .cpp files in total), such as llama.cpp, gpt2.cpp, qwen2.cpp, etc.

build_graph(): The Giant Switch Dispatch

All architectures converge at a single entry point — llama_model::build_graph() (located in src/llama-model.cpp):

ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
    std::unique_ptr<llm_graph_context> llm;

    switch (arch) {
        case LLM_ARCH_LLAMA:
            llm = std::make_unique<llm_build_llama<false>>(*this, params);
            break;
        case LLM_ARCH_GPT2:
            llm = std::make_unique<llm_build_gpt2>(*this, params);
            break;
        case LLM_ARCH_QWEN2:
            llm = std::make_unique<llm_build_qwen2>(*this, params);
            break;
        // ... 125 cases ...
        default:
            GGML_ABORT("fatal error");
    }

    // Post-processing shared by all architectures
    llm->build_pooling(cls, cls_b, cls_out, cls_out_b, cls_norm);
    llm->build_sampling();
    llm->build_dense_out(dense_2_out_layers, ...);

    llm->res->set_outputs();
    return llm->res->get_gf();
}

The key insight: each case creates an architecture-specific graph builder (e.g., llm_build_llama, llm_build_gpt2), and the compute graph is constructed inside its constructor. After construction, build_graph() appends shared post-processing nodes such as pooling, sampling, etc.

build_graph() Dispatch Flow
build_graph()
llama_model method
switch (arch)
125 case branches
llm_build_llama
Graph built in constructor
llm_build_gpt2
Graph built in constructor
... other architectures
Same pattern
Post-processing
pooling / sampling / dense_out
set_outputs() → get_gf()
Return compute graph

Part B: The llm_graph_context Building Block Interface

All architecture-specific graph builders inherit from the llm_graph_context base class (defined in src/llama-graph.h). This base class provides a set of reusable “building block” methods. The inheritance hierarchy is as follows:

Graph Builder Inheritance Hierarchy
llm_graph_context
Base class: provides all building block methods
llm_build_llama
Llama family
llm_build_gpt2
GPT-2
llm_build_qwen2
Qwen2 family

Core Building Block Methods

MethodFunctionSupported Variants
build_norm()NormalizationLLM_NORM (standard LayerNorm), LLM_NORM_RMS (RMSNorm)
build_attn()AttentionKV cache, no cache (encoder), ISWA (sliding window), cross-attention
build_ffn()Feed-forward networkSiLU, GELU, ReLU and other activations; parallel/sequential gate
build_moe_ffn()MoE FFNSupports softmax/top-k gating, on-demand expert loading

Each architecture’s graph builder calls these building block methods in its constructor, assembling the complete compute graph according to that architecture’s transformer structure. The building block methods use enum parameters (e.g., LLM_NORM_RMS, LLM_FFN_SILU, LLM_FFN_PAR) to select variants, so different architectures simply pass different parameters to reuse the same underlying implementation.


Part C: Concrete Comparison — Llama vs GPT-2

Comparing two classic architectures side by side gives an intuitive understanding of how the building blocks are composed.

Llama Architecture

The Llama graph builder in src/models/llama.cpp:

llm_build_llama::llm_build_llama(const llama_model & model,
                                  const llm_graph_params & params)
    : llm_graph_context(params) {

    // Input layer
    inpL = build_inp_embd(model.tok_embd);
    auto * inp_pos  = build_inp_pos();
    auto * inp_attn = build_attn_inp_kv();  // KV cache attention

    for (int il = 0; il < n_layer; ++il) {
        // Attention branch
        cur = build_norm(inpL, model.layers[il].attn_norm, NULL,
                         LLM_NORM_RMS, il);                    // ← RMSNorm

        Qcur = build_lora_mm(model.layers[il].wq, cur, ...);  // Q = Wx + LoRA
        Kcur = build_lora_mm(model.layers[il].wk, cur, ...);
        Vcur = build_lora_mm(model.layers[il].wv, cur, ...);

        Qcur = ggml_rope_ext(ctx0, Qcur, inp_pos, ...);       // ← RoPE positional encoding
        Kcur = ggml_rope_ext(ctx0, Kcur, inp_pos, ...);

        cur = build_attn(inp_attn, model.layers[il].wo, NULL,
                         Qcur, Kcur, Vcur, ..., kq_scale, il);

        // Residual connection + FFN
        ffn_inp = ggml_add(ctx0, cur, inpL);
        cur = build_norm(ffn_inp, model.layers[il].ffn_norm, NULL,
                         LLM_NORM_RMS, il);

        if (model.layers[il].ffn_gate_inp == nullptr) {
            cur = build_ffn(cur, ..., LLM_FFN_SILU, LLM_FFN_PAR, il);  // ← SiLU + parallel gate
        } else {
            cur = build_moe_ffn(cur, ..., LLM_FFN_SILU, ...);  // MoE variant
        }

        inpL = ggml_add(ctx0, cur, ffn_inp);  // Residual
    }

    // Output
    cur = build_norm(inpL, model.output_norm, NULL, LLM_NORM_RMS, -1);
    cur = build_lora_mm(model.output, cur);   // logits
    res->t_logits = cur;
    ggml_build_forward_expand(gf, cur);
}

Llama’s hallmarks: RMSNorm normalization, RoPE rotary positional encoding (applied to Q and K), separate Q/K/V projections (three matrix multiplications), SiLU + parallel gate FFN, and support for MoE variants.

GPT-2 Architecture

The GPT-2 graph builder in src/models/gpt2.cpp — same building blocks, different composition:

llm_build_gpt2::llm_build_gpt2(const llama_model & model,
                                const llm_graph_params & params)
    : llm_graph_context(params) {

    inpL = build_inp_embd(model.tok_embd);
    auto * inp_pos = build_inp_pos();

    // GPT-2 uses learned positional embeddings, not RoPE
    pos = ggml_get_rows(ctx0, model.pos_embd, inp_pos);
    inpL = ggml_add(ctx0, inpL, pos);

    auto * inp_attn = build_attn_inp_kv();

    for (int il = 0; il < n_layer; ++il) {
        cur = build_norm(inpL, model.layers[il].attn_norm,
                         model.layers[il].attn_norm_b,
                         LLM_NORM, il);            // ← Standard LayerNorm (not RMSNorm)

        // GPT-2 uses a fused QKV projection (single matrix multiplication)
        cur = build_lora_mm(model.layers[il].wqkv, cur);
        cur = ggml_add(ctx0, cur, model.layers[il].bqkv);
        // Split QKV using views (zero-copy)
        Qcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head, n_tokens, ...);
        Kcur = ggml_view_3d(ctx0, cur, ...);
        Vcur = ggml_view_3d(ctx0, cur, ...);

        cur = build_attn(inp_attn, ..., 1.0f/sqrtf(float(n_embd_head)), il);

        ffn_inp = ggml_add(ctx0, cur, inpL);
        cur = build_norm(ffn_inp, ..., LLM_NORM, il);  // ← Standard LayerNorm

        cur = build_ffn(cur, ..., LLM_FFN_GELU, LLM_FFN_SEQ, il);  // ← GELU + sequential gate

        inpL = ggml_add(ctx0, cur, ffn_inp);
    }

    cur = build_norm(inpL, model.output_norm, model.output_norm_b, LLM_NORM, -1);
    cur = build_lora_mm(model.output, cur);
    res->t_logits = cur;
    ggml_build_forward_expand(gf, cur);
}

GPT-2’s hallmarks: standard LayerNorm (with bias parameters), learned positional embeddings (added directly to input), fused QKV single projection (split via ggml_view_3d with zero copy), and GELU + sequential FFN.

Key Differences

FeatureLlamaGPT-2
NormalizationRMSNormStandard LayerNorm
Positional encodingRoPE (Rotary Positional Encoding)Learned positional embeddings
QKV projectionSeparate Q, K, V projectionsFused QKV single projection
FFN activationSiLU + parallel gateGELU + sequential
BiasTypically no biasHas bias

But from a code structure perspective, the two share an identical framework: input -> layer loop (norm -> attn -> residual -> norm -> ffn -> residual) -> output. This is the power of the building block design.

Interactive Comparison

The interactive component below displays the Llama and GPT-2 construction flows side by side, highlighting the steps that differ:

Architecture Comparison: Llama vs GPT-2

Llama
Input
build_inp_embd
token → embedding
build_inp_pos
RoPE position input
Layer Loop(x N)
build_norm (RMSNorm)
Pre-attention norm
Separate Q/K/V projections
build_lora_mm × 3
RoPE rotation
ggml_rope_ext(Q, K)
build_attn
Attention + output proj
Residual connection
cur + inpL
build_norm (RMSNorm)
Pre-FFN norm
SiLU FFN (parallel gate)
build_ffn / build_moe_ffn
Residual connection
cur + ffn_inp
Output
build_norm (RMSNorm)
Final norm
build_lora_mm → logits
Output projection
GPT-2
Input
build_inp_embd
token → embedding
build_inp_pos + add
Learnable pos embedding added
Layer Loop(x N)
build_norm (LayerNorm)
Pre-attention norm
Merged QKV projection
build_lora_mm + view split
(No RoPE)
Position encoded in embedding
build_attn
Attention + output proj
Residual connection
cur + inpL
build_norm (LayerNorm)
Pre-FFN norm
GELU FFN (sequential)
build_ffn (LLM_FFN_SEQ)
Residual connection
cur + ffn_inp
Output
build_norm (LayerNorm)
Final norm
build_lora_mm → logits
Output projection
SameDifferent
Key Differences
FeatureLlamaGPT-2
NormalizationRMSNormStandard LayerNorm
Position EncodingRoPE (Rotary Position Embedding)Learnable position embedding
QKV ProjectionSeparate Q, K, V projectionsMerged QKV single projection
FFN ActivationSiLU + parallel gateGELU + sequential
BiasTypically no biasHas bias
Despite different building blocks, both share the same macro framework: input → layer loop (norm → attn → residual → norm → ffn → residual) → output.

Part D: Graph Reuse Mechanism

Constructing a compute graph has non-trivial overhead (creating ggml tensor objects, establishing dependency relationships), especially during the autoregressive decoding phase where each ubatch may contain only a few tokens and the graph topology is exactly the same. llama.cpp avoids redundant construction through a graph reuse mechanism.

Reuse Logic in process_ubatch()

Inside process_ubatch():

// src/llama-context.cpp (simplified)
llm_graph_result * llama_context::process_ubatch(
        const llama_ubatch & ubatch, ...) {

    auto * res = gf_res_prev.get();   // Previous graph result
    const auto gparams = graph_params(res, ubatch, mctx, gtype);

    if (!graph_reuse_disable && res->can_reuse(gparams)) {
        // Graph topology unchanged — reuse directly! Only update input data
        n_reused++;
    } else {
        // Need to rebuild the graph
        res->reset();
        ggml_backend_sched_reset(sched.get());
        gf = model.build_graph(gparams);           // Rebuild
        ggml_backend_sched_alloc_graph(sched.get(), gf);  // Reallocate
    }

    res->set_inputs(&ubatch);  // Set new input data regardless of reuse
    graph_compute(res->get_gf(), ubatch.n_tokens > 1);  // Execute
    return res;
}

Reuse Conditions

can_reuse() checks the following conditions:

  • Same ubatch shape: n_tokens, n_seq_tokens, n_seqs are identical
  • Same input mode: both are tokens or both are embeddings
  • Unchanged sequence IDs: under equal_seqs mode
  • Same number of outputs
  • Unchanged model configuration: causal_attn, embeddings, LoRA, etc.

During typical autoregressive decoding (generating 1 token at a time), consecutive ubatches usually have exactly the same shape, so the graph reuse rate is very high.

Graph Reuse Decision Flow
New ubatch arrives
process_ubatch()
can_reuse()?
Check shape/mode/config
Reuse old graph
Skip build_graph
Rebuild graph
build_graph() + alloc
set_inputs()
Set new input data
graph_compute()
Execute computation

Summary

This article walked through the complete path of compute graph construction in llama.cpp:

  1. One entry point: build_graph() dispatches 125 architectures to their respective graph builders via a giant switch statement
  2. Shared building blocks: All graph builders inherit from llm_graph_context, reusing methods like build_norm/build_attn/build_ffn
  3. Flexible composition: Different architectures select normalization types, activation functions, projection strategies, and other variants by passing different enum parameters
  4. Graph reuse: During the autoregressive decoding phase, ubatches with unchanged shapes can skip graph construction and directly reuse the previous compute graph

The next article covers #6 Backend Scheduling & Memory Management, exploring how the constructed compute graph is assigned to CPU/GPU backends and executed.