Compute Graph Construction & Architecture Dispatch

Series context: This is article #5 in the llama.cpp source walkthrough series. llama.cpp’s compute graph is built on GGML’s lazy evaluation model. If you are not yet familiar with that model, consider reading Compute Graphs & Inference Engines first. This article dives straight into the C++ implementation of build_graph().

Before each ubatch enters computation, a computation graph must be constructed — it describes all tensor operations and their dependency relationships. llama.cpp supports 125 different model architectures, each with its own graph topology, but they all share a common set of reusable “building block” interfaces.

Part A: 125 Architectures, One Entry Point

The llm_arch Enum

src/llama-arch.h defines the llm_arch enum, listing all supported model architectures:

enum llm_arch {
    LLM_ARCH_CLIP,
    LLM_ARCH_LLAMA,
    LLM_ARCH_LLAMA4,
    LLM_ARCH_FALCON,
    LLM_ARCH_GPT2,
    LLM_ARCH_GPTJ,
    LLM_ARCH_GPTNEOX,
    LLM_ARCH_BERT,
    LLM_ARCH_QWEN2,
    LLM_ARCH_GEMMA3,
    // ... 125 known architectures total ...
    LLM_ARCH_UNKNOWN,
};

Each architecture corresponds to a separate implementation file under src/models/ (113 .cpp files in total), such as llama.cpp, gpt2.cpp, qwen2.cpp, etc.

build_graph(): The Giant Switch Dispatch

All architectures converge at a single entry point — llama_model::build_graph() (located in src/llama-model.cpp):

ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
    std::unique_ptr<llm_graph_context> llm;

    switch (arch) {
        case LLM_ARCH_LLAMA:
            llm = std::make_unique<llm_build_llama<false>>(*this, params);
            break;
        case LLM_ARCH_GPT2:
            llm = std::make_unique<llm_build_gpt2>(*this, params);
            break;
        case LLM_ARCH_QWEN2:
            llm = std::make_unique<llm_build_qwen2>(*this, params);
            break;
        // ... 125 cases ...
        default:
            GGML_ABORT("fatal error");
    }

    // Post-processing shared by all architectures
    llm->build_pooling(cls, cls_b, cls_out, cls_out_b, cls_norm);
    llm->build_sampling();
    llm->build_dense_out(dense_2_out_layers, ...);

    llm->res->set_outputs();
    return llm->res->get_gf();
}

The key insight: each case creates an architecture-specific graph builder (e.g., llm_build_llama, llm_build_gpt2), and the compute graph is constructed inside its constructor. After construction, build_graph() appends shared post-processing nodes such as pooling, sampling, etc.

build_graph() Dispatch Flow

build_graph()

llama_model method

switch (arch)

125 case branches

llm_build_llama

Graph built in constructor

llm_build_gpt2

Graph built in constructor

... other architectures

Same pattern

Post-processing

pooling / sampling / dense_out

set_outputs() → get_gf()

Return compute graph

Part B: The llm_graph_context Building Block Interface

All architecture-specific graph builders inherit from the llm_graph_context base class (defined in src/llama-graph.h). This base class provides a set of reusable “building block” methods. The inheritance hierarchy is as follows:

Graph Builder Inheritance Hierarchy

llm_graph_context

Base class: provides all building block methods

llm_build_llama

Llama family

llm_build_gpt2

GPT-2

llm_build_qwen2

Qwen2 family

Core Building Block Methods

Method	Function	Supported Variants
`build_norm()`	Normalization	LLM_NORM (standard LayerNorm), LLM_NORM_RMS (RMSNorm)
`build_attn()`	Attention	KV cache, no cache (encoder), ISWA (sliding window), cross-attention
`build_ffn()`	Feed-forward network	SiLU, GELU, ReLU and other activations; parallel/sequential gate
`build_moe_ffn()`	MoE FFN	Supports softmax/top-k gating, on-demand expert loading

Each architecture’s graph builder calls these building block methods in its constructor, assembling the complete compute graph according to that architecture’s transformer structure. The building block methods use enum parameters (e.g., LLM_NORM_RMS, LLM_FFN_SILU, LLM_FFN_PAR) to select variants, so different architectures simply pass different parameters to reuse the same underlying implementation.

Part C: Concrete Comparison — Llama vs GPT-2

Comparing two classic architectures side by side gives an intuitive understanding of how the building blocks are composed.

Llama Architecture

The Llama graph builder in src/models/llama.cpp:

llm_build_llama::llm_build_llama(const llama_model & model,
                                  const llm_graph_params & params)
    : llm_graph_context(params) {

    // Input layer
    inpL = build_inp_embd(model.tok_embd);
    auto * inp_pos  = build_inp_pos();
    auto * inp_attn = build_attn_inp_kv();  // KV cache attention

    for (int il = 0; il < n_layer; ++il) {
        // Attention branch
        cur = build_norm(inpL, model.layers[il].attn_norm, NULL,
                         LLM_NORM_RMS, il);                    // ← RMSNorm

        Qcur = build_lora_mm(model.layers[il].wq, cur, ...);  // Q = Wx + LoRA
        Kcur = build_lora_mm(model.layers[il].wk, cur, ...);
        Vcur = build_lora_mm(model.layers[il].wv, cur, ...);

        Qcur = ggml_rope_ext(ctx0, Qcur, inp_pos, ...);       // ← RoPE positional encoding
        Kcur = ggml_rope_ext(ctx0, Kcur, inp_pos, ...);

        cur = build_attn(inp_attn, model.layers[il].wo, NULL,
                         Qcur, Kcur, Vcur, ..., kq_scale, il);

        // Residual connection + FFN
        ffn_inp = ggml_add(ctx0, cur, inpL);
        cur = build_norm(ffn_inp, model.layers[il].ffn_norm, NULL,
                         LLM_NORM_RMS, il);

        if (model.layers[il].ffn_gate_inp == nullptr) {
            cur = build_ffn(cur, ..., LLM_FFN_SILU, LLM_FFN_PAR, il);  // ← SiLU + parallel gate
        } else {
            cur = build_moe_ffn(cur, ..., LLM_FFN_SILU, ...);  // MoE variant
        }

        inpL = ggml_add(ctx0, cur, ffn_inp);  // Residual
    }

    // Output
    cur = build_norm(inpL, model.output_norm, NULL, LLM_NORM_RMS, -1);
    cur = build_lora_mm(model.output, cur);   // logits
    res->t_logits = cur;
    ggml_build_forward_expand(gf, cur);
}

Llama’s hallmarks: RMSNorm normalization, RoPE rotary positional encoding (applied to Q and K), separate Q/K/V projections (three matrix multiplications), SiLU + parallel gate FFN, and support for MoE variants.

GPT-2 Architecture

The GPT-2 graph builder in src/models/gpt2.cpp — same building blocks, different composition:

llm_build_gpt2::llm_build_gpt2(const llama_model & model,
                                const llm_graph_params & params)
    : llm_graph_context(params) {

    inpL = build_inp_embd(model.tok_embd);
    auto * inp_pos = build_inp_pos();

    // GPT-2 uses learned positional embeddings, not RoPE
    pos = ggml_get_rows(ctx0, model.pos_embd, inp_pos);
    inpL = ggml_add(ctx0, inpL, pos);

    auto * inp_attn = build_attn_inp_kv();

    for (int il = 0; il < n_layer; ++il) {
        cur = build_norm(inpL, model.layers[il].attn_norm,
                         model.layers[il].attn_norm_b,
                         LLM_NORM, il);            // ← Standard LayerNorm (not RMSNorm)

        // GPT-2 uses a fused QKV projection (single matrix multiplication)
        cur = build_lora_mm(model.layers[il].wqkv, cur);
        cur = ggml_add(ctx0, cur, model.layers[il].bqkv);
        // Split QKV using views (zero-copy)
        Qcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head, n_tokens, ...);
        Kcur = ggml_view_3d(ctx0, cur, ...);
        Vcur = ggml_view_3d(ctx0, cur, ...);

        cur = build_attn(inp_attn, ..., 1.0f/sqrtf(float(n_embd_head)), il);

        ffn_inp = ggml_add(ctx0, cur, inpL);
        cur = build_norm(ffn_inp, ..., LLM_NORM, il);  // ← Standard LayerNorm

        cur = build_ffn(cur, ..., LLM_FFN_GELU, LLM_FFN_SEQ, il);  // ← GELU + sequential gate

        inpL = ggml_add(ctx0, cur, ffn_inp);
    }

    cur = build_norm(inpL, model.output_norm, model.output_norm_b, LLM_NORM, -1);
    cur = build_lora_mm(model.output, cur);
    res->t_logits = cur;
    ggml_build_forward_expand(gf, cur);
}

GPT-2’s hallmarks: standard LayerNorm (with bias parameters), learned positional embeddings (added directly to input), fused QKV single projection (split via ggml_view_3d with zero copy), and GELU + sequential FFN.

Key Differences

Feature	Llama	GPT-2
Normalization	RMSNorm	Standard LayerNorm
Positional encoding	RoPE (Rotary Positional Encoding)	Learned positional embeddings
QKV projection	Separate Q, K, V projections	Fused QKV single projection
FFN activation	SiLU + parallel gate	GELU + sequential
Bias	Typically no bias	Has bias

But from a code structure perspective, the two share an identical framework: input -> layer loop (norm -> attn -> residual -> norm -> ffn -> residual) -> output. This is the power of the building block design.

Interactive Comparison

The interactive component below displays the Llama and GPT-2 construction flows side by side, highlighting the steps that differ:

Architecture Comparison: Llama vs GPT-2

Different

Llama

Input

build_inp_embd

token → embedding

build_inp_pos

RoPE position input

Layer Loop(x N)

build_norm (RMSNorm)

Pre-attention norm

Separate Q/K/V projections

build_lora_mm × 3

RoPE rotation

ggml_rope_ext(Q, K)

build_attn

Attention + output proj

Residual connection

cur + inpL

build_norm (RMSNorm)

Pre-FFN norm

SiLU FFN (parallel gate)

build_ffn / build_moe_ffn

Residual connection

cur + ffn_inp

Output

build_norm (RMSNorm)

Final norm

build_lora_mm → logits

Output projection

GPT-2

Input

build_inp_embd

token → embedding

build_inp_pos + add

Learnable pos embedding added

Layer Loop(x N)

build_norm (LayerNorm)

Pre-attention norm

Merged QKV projection

build_lora_mm + view split

(No RoPE)

Position encoded in embedding

build_attn

Attention + output proj

Residual connection

cur + inpL

build_norm (LayerNorm)

Pre-FFN norm

GELU FFN (sequential)

build_ffn (LLM_FFN_SEQ)

Residual connection

cur + ffn_inp

Output

build_norm (LayerNorm)

Final norm

build_lora_mm → logits

Output projection

SameDifferent

Key Differences

Feature	Llama	GPT-2
Normalization	RMSNorm	Standard LayerNorm
Position Encoding	RoPE (Rotary Position Embedding)	Learnable position embedding
QKV Projection	Separate Q, K, V projections	Merged QKV single projection
FFN Activation	SiLU + parallel gate	GELU + sequential
Bias	Typically no bias	Has bias

Despite different building blocks, both share the same macro framework: input → layer loop (norm → attn → residual → norm → ffn → residual) → output.

Part D: Graph Reuse Mechanism

Constructing a compute graph has non-trivial overhead (creating ggml tensor objects, establishing dependency relationships), especially during the autoregressive decoding phase where each ubatch may contain only a few tokens and the graph topology is exactly the same. llama.cpp avoids redundant construction through a graph reuse mechanism.

Reuse Logic in process_ubatch()

Inside process_ubatch():

// src/llama-context.cpp (simplified)
llm_graph_result * llama_context::process_ubatch(
        const llama_ubatch & ubatch, ...) {

    auto * res = gf_res_prev.get();   // Previous graph result
    const auto gparams = graph_params(res, ubatch, mctx, gtype);

    if (!graph_reuse_disable && res->can_reuse(gparams)) {
        // Graph topology unchanged — reuse directly! Only update input data
        n_reused++;
    } else {
        // Need to rebuild the graph
        res->reset();
        ggml_backend_sched_reset(sched.get());
        gf = model.build_graph(gparams);           // Rebuild
        ggml_backend_sched_alloc_graph(sched.get(), gf);  // Reallocate
    }

    res->set_inputs(&ubatch);  // Set new input data regardless of reuse
    graph_compute(res->get_gf(), ubatch.n_tokens > 1);  // Execute
    return res;
}

Reuse Conditions

can_reuse() checks the following conditions:

Same ubatch shape: n_tokens, n_seq_tokens, n_seqs are identical
Same input mode: both are tokens or both are embeddings
Unchanged sequence IDs: under equal_seqs mode
Same number of outputs
Unchanged model configuration: causal_attn, embeddings, LoRA, etc.

During typical autoregressive decoding (generating 1 token at a time), consecutive ubatches usually have exactly the same shape, so the graph reuse rate is very high.

Graph Reuse Decision Flow

New ubatch arrives

process_ubatch()

can_reuse()?

Check shape/mode/config

Reuse old graph

Skip build_graph

Rebuild graph

build_graph() + alloc

set_inputs()

Set new input data

graph_compute()

Execute computation

Summary

This article walked through the complete path of compute graph construction in llama.cpp:

One entry point: build_graph() dispatches 125 architectures to their respective graph builders via a giant switch statement
Shared building blocks: All graph builders inherit from llm_graph_context, reusing methods like build_norm/build_attn/build_ffn
Flexible composition: Different architectures select normalization types, activation functions, projection strategies, and other variants by passing different enum parameters
Graph reuse: During the autoregressive decoding phase, ubatches with unchanged shapes can skip graph construction and directly reuse the previous compute graph

The next article covers #6 Backend Scheduling & Memory Management, exploring how the constructed compute graph is assigned to CPU/GPU backends and executed.