Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Backend Scheduling, Op Fusion & Memory Allocation

Backend Scheduling, Op Fusion & Memory Allocation

Updated 2026-04-15

Series context: This is article #6 in the llama.cpp source code deep-dive series, and the longest and most substantial installment. After the compute graph is built, the next question is: which backend (GPU / CPU) should execute each operation in the graph? When a model spans multiple devices (e.g., some layers on GPU, others on CPU), the scheduler must partition the compute graph into several splits, each executing on a single backend, with data copies inserted at split boundaries. This article covers three topics: Backend Scheduling (the five-pass algorithm), Op Fusion, and Tensor Memory Allocation.


Part A: Backend Scheduling

When you run a 32-layer model with --n-gpu-layers 24, the first few layers’ weights stay on the CPU while the rest are on the GPU. Operations in the compute graph cannot all be sent to a single device — a scheduler is needed to decide where each node executes and automatically insert data copies at device transitions. This is the responsibility of ggml_backend_sched.

The scheduler has three design goals:

  1. Minimize cross-device data copy count — each CPU-GPU copy incurs a fixed overhead (PCIe latency ~5-10 us) and bandwidth limitations (PCIe 4.0 x16 ~25 GB/s)
  2. Place operations on the highest-priority backend (typically GPU) whenever possible — GPU parallel compute capability far exceeds CPU
  3. Ensure each operation’s backend supports that operation type — not all backends support all operations (e.g., some quantization types are only implemented on CPU)

The five-pass algorithm is the core mechanism for achieving these three goals.

Scheduler Data Structure

ggml_backend_sched is the central scheduling structure (defined in ggml/src/ggml-backend.cpp):

struct ggml_backend_sched {
    int n_backends;
    ggml_backend_t backends[GGML_SCHED_MAX_BACKENDS];      // up to 16 backends
    ggml_backend_buffer_type_t bufts[GGML_SCHED_MAX_BACKENDS];

    int * node_backend_ids;    // assigned backend ID for each graph node
    int * leaf_backend_ids;    // assigned backend ID for each leaf

    struct ggml_backend_sched_split * splits;  // split results
    int n_splits;

    int n_copies;   // pipeline parallelism: 1 (single device) or 4 (multi-device)
    int cur_copy;
    // ...
};

Key fields explained:

  • The backends[] array is sorted by priority: index 0 is highest priority (typically GPU), and the last entry is CPU. The scheduler prefers placing operations on higher-priority backends
  • node_backend_ids and leaf_backend_ids are integer arrays sized to match the compute graph, recording which backend each node/leaf is assigned to (by index, -1 means unassigned)
  • bufts[] records the buffer type for each backend, used to check whether a tensor’s buffer is compatible with a target backend
  • n_copies controls pipeline parallelism depth — 1 means no pipelining, 4 means 4-stage pipelining

Five-Pass Algorithm

ggml_backend_sched_split_graph() is the core scheduling function. It assigns each compute graph node to a backend and partitions the graph into splits through five passes:

Five-Pass Algorithm Overview
Pass 1: Initial Assignment
Determine backend from tensor buffer
Pass 2: Propagation
Spread GPU backend up and down
Pass 3: Upgrade Optimization
Try promoting CPU ops to GPU
Pass 4: Source Tensor Assignment
Assign backends to unassigned sources
Pass 5: Graph Splitting
Split by backend + insert copies

Pass 1: Initial Assignment

Iterates over all leaves (pre-allocated input tensors such as weights) and nodes, determining their backend based on their buffer’s location:

  • Weights in a GPU buffer -> assign to GPU backend
  • Inputs in a CPU buffer -> assign to CPU backend (lowest priority)
  • If an operation has weight parameters, prefer the backend where the weights reside

After Pass 1, only tensors that “directly hold an allocated buffer” have a backend assignment. Many intermediate operations without weights (such as ADD, MUL, SOFTMAX, etc.) remain unassigned. In a typical Llama model, each layer has 20+ nodes, but only 4-5 nodes (Q/K/V/O weights and FFN weights) have direct buffer associations. Pass 1 can only assign these few nodes; the bulk of the work is left to Pass 2.

Pass 2: Propagation

Four sub-passes propagate backend assignments bidirectionally along the graph’s dependency edges:

  1. “Expand GPU down” (forward): if a GPU node’s successor is unassigned, propagate GPU
  2. “Expand GPU up” (reverse): if a GPU node’s predecessor is unassigned, propagate GPU
  3. “Expand rest down”: handle remaining unassigned nodes (forward), propagating non-GPU backends downward
  4. “Expand rest up”: final reverse pass to ensure no node is missed

Why propagate GPU first? Because GPU is the highest-priority backend (index 0). Propagating GPU first maximizes the number of operations placed on the GPU, reducing cross-device data copies. Only nodes that GPU cannot cover are assigned to other backends by the latter two sub-passes.

During propagation, ggml_backend_sched_set_if_supported() is called to confirm the target backend supports the operation. If the GPU backend does not support a particular op (e.g., certain specialized quantization types), that node is skipped and left for subsequent sub-passes. These skipped nodes are ultimately caught by a safety-net mechanism — see CPU Fallback Safety Net below.

Pass 3: Upgrade Optimization

For nodes already assigned to a lower-priority backend (e.g., CPU), attempts to upgrade them to a higher-priority backend — provided that backend supports the operation type and all input tensor buffer types. This step reduces split count: if a CPU node is upgraded to GPU, it no longer needs to be a split boundary.

Typical beneficiaries of upgrade optimization are simple element-wise operations like ADD, MUL, and SCALE. These operations are supported on both CPU and GPU, but Pass 2’s propagation may have assigned them to CPU due to path ordering. Pass 3 discovers that GPU can also handle these operations (and the input tensors can be read from GPU buffers), and upgrades them to GPU.

Pass 4: Source Tensor Assignment

Assigns backends to all source tensors (including view tensors) that remain unassigned. This step primarily handles “passive” tensors that have no operation of their own:

  • View tensors inherit the parent tensor’s backend (since a view is just a window into the parent tensor’s memory and does not own its own memory)
  • Other source tensors inherit the backend of their consumer node (following the “data locality principle” — prepare data on the device where the consumer resides)

After Pass 4, all tensors in the compute graph (including nodes, leaves, and sources) have a definite backend assignment.

Pass 5: Graph Splitting and Copy Insertion

Traverses the assignment results and creates split boundaries at backend transitions. Specifically, it scans nodes sequentially; when two adjacent nodes have different backend IDs, a new split begins at that point. For input tensors that cross split boundaries, the scheduler creates a duplicate (dup) of the tensor, allocates space on the target backend’s buffer, and replaces graph references with this duplicate — the original tensor is recorded as a split input and automatically copied by the scheduler at execution time.

CPU Fallback Safety Net

The five-pass algorithm has an important implicit guarantee: no operation is left without a home. If a GPU backend does not support a particular op, the op does not error out — it automatically falls back to CPU. This is because the CPU backend implements ggml’s complete operation set, serving as a universal fallback.

This guarantee spans multiple passes: during Pass 2 propagation, ggml_backend_sched_set_if_supported() detects that the GPU does not support an op and skips that node (keeping backend_id = -1); Pass 3’s upgrade optimization also will not force an assignment. After the first four passes, if any nodes remain unassigned, the scheduler iterates through all backends one by one until a supported one is found:

// ggml-backend.cpp -- fallback logic after Pass 2
for (int b = 0; b < sched->n_backends && *cur_backend_id == -1; b++) {
    ggml_backend_sched_set_if_supported(sched, node, b, cur_backend_id);
}
GGML_ASSERT(*cur_backend_id != -1);

Since the last element of the backends[] array is always CPU, and the CPU backend supports all ops, this assert should never trigger under normal circumstances.

The flow diagram below illustrates a specific fallback scenario — suppose a WebGPU backend encounters a CONV_2D operation it does not support (e.g., when running a vision model):

Unsupported Op Automatically Falls Back to CPU
CONV_2D Node
Pass 2 propagation phase
WebGPU supports_op?
Returns false
CPU supports_op?
Returns true -- CPU supports all ops
Assign to CPU backend
backend_id = cpu_id
Pass 5: Insert copy nodes
WebGPU->CPU and CPU->WebGPU
Normal execution
Fallback op runs on CPU, rest on GPU

The cost of fallback is performance, not correctness. The op that falls back to CPU will itself run slower, and more importantly, Pass 5 inserts one tensor copy each for GPU->CPU and CPU->GPU, introducing additional data transfer overhead. If many ops fall back, frequent GPU-CPU copying becomes the dominant bottleneck. However, for current mainstream backends (CUDA, Metal, Vulkan) that already cover all core LLM inference ops, standard Transformer text models typically do not trigger fallback.

Interactive Demo

The interactive component below demonstrates how the five-pass algorithm assigns each node in a small compute graph (simulating a 2-layer model with some weights on GPU and some on CPU) to a backend, and ultimately partitions it into splits. Click “Next” to step through each pass and observe the color changes for each node — blue represents GPU, gray represents CPU, and white represents unassigned. The final step shows the split boundaries and automatically inserted copy nodes:

Backend Scheduler 5-Pass Visualization

Step 0/5
Pass 0: Initial State
Computation graph is built. All nodes are unassigned.
EmbeddingCPU weightLayer0 AttnUnassignedLayer0 FFNUnassignedLayer1 AttnUnassignedLayer1 FFNUnassignedOutputGPU weight
GPUCPUUnassigned

ggml_backend_sched_split Structure

After the five-pass algorithm completes, the compute graph is partitioned into several splits. Each split has the following data structure:

struct ggml_backend_sched_split {
    int backend_id;           // which backend this split runs on
    int i_start, i_end;       // node range in the graph
    struct ggml_tensor * inputs[GGML_SCHED_MAX_SPLIT_INPUTS];  // inputs that need copying
    int n_inputs;             // up to 30
    struct ggml_cgraph graph; // subgraph view
};

The inputs array records tensors that need to be copied from other backends into this split. GGML_SCHED_MAX_SPLIT_INPUTS is limited to 30 — more than sufficient for typical Transformer models, since split boundaries usually involve only a single hidden state tensor that needs cross-device transfer. The graph field is a view of the original large graph; it does not own its own node array but simply delineates a range using i_start and i_end.

Splitting Example

Consider a concrete splitting scenario. Suppose a 32-layer model with n_gpu_layers=24, so i_gpu_start=9 (total layers 32 minus GPU layers 24 plus 1). The first 9 layers run on CPU, while the remaining 23 layers + output run on GPU. This is a typical configuration when the user’s VRAM cannot hold the entire model. The five-pass algorithm produces 2 splits:

32-Layer Model Splitting Example (n_gpu_layers=24)
Embedding
Layer 0
... Layer 8
Split 0 — CPU
tensor copy
CPU -> GPU
Layer 9
... Layer 31
Output
Split 1 — GPU

At the split boundary, Layer 8’s output (the hidden state tensor) needs to be copied from CPU to GPU. This copy operation is automatically inserted by the scheduler. From the user’s perspective, all of this is transparent — just set n_gpu_layers, and the scheduler automatically handles all splitting and copying.

Note that this 32-layer model produces only 2 splits and 1 copy. This is the scheduler’s goal — minimize split count (i.e., minimize cross-device copy count). If all layers were alternately assigned to CPU and GPU, every layer would theoretically be a split boundary, and performance would be completely dominated by data copies. In practice, Pass 3’s upgrade optimization exists precisely to eliminate such unnecessary split boundaries.

Cross-Backend Data Copying

During the execution phase (ggml_backend_sched_compute_splits()), before each split begins, the scheduler handles that split’s input copies:

// Pseudocode
for (int i = 0; i < split->n_inputs; i++) {
    struct ggml_tensor * input     = split->inputs[i];     // source tensor (on another backend)
    struct ggml_tensor * input_cpy = tensor_copy(input, split->backend_id, cur_copy);

    // Try async copy first
    if (backend->cpy_tensor_async(input_backend, split_backend, input, input_cpy)) {
        // Async copy enqueued, non-blocking
    } else {
        // Fall back to sync copy
        ggml_backend_synchronize(input_backend);
        ggml_backend_tensor_copy(input, input_cpy);
    }
}

The cur_copy parameter in tensor_copy() is related to pipeline parallelism — it specifies which set of tensor copies to use. On a single device, cur_copy is always 0.

Copy operations prefer async copy (cpy_tensor_async) — enqueueing the data transfer into the backend’s command queue without blocking the CPU. Only when the backend does not support async copy does it fall back to sync copy (ggml_backend_tensor_copy), which requires synchronizing the source backend first to ensure data is ready.

MoE Optimization: For MoE model expert weights, the scheduler does not copy all experts — it reads the routing result and copies only the weights of actually activated experts. For example, in a top-2-of-8 MoE layer, only 2 experts’ weights need to be copied, saving approximately 75% of copy bandwidth. This is critical for the performance of large MoE models (e.g., Mixtral 8x7B) — if all experts were copied, cross-device bandwidth could become a severe bottleneck.

The scheduler determines which experts are activated by reading the MoE routing node’s output in ggml_backend_sched_compute_splits(). This means the MoE routing computation must complete before expert weight copying — the scheduler ensures this execution order.

Pipeline Parallelism

When a model is distributed across multiple GPUs (--split-mode layer), purely sequential execution leads to significant GPU idle time: after GPU 0 finishes split 0, it must wait for data to be copied to GPU 1, and GPU 1 can only start then — GPU 0 sits completely idle. llama.cpp uses pipeline parallelism to overlap computation and copying, eliminating this wait.

The implementation maintains 4 sets of tensor copies (GGML_SCHED_MAX_COPIES = 4), used in rotation. For a 2-GPU example, the execution flow is:

  1. GPU 0 computes Split 0 (first half of layers)
  2. Split 0’s results begin async copy to GPU 1 (CPU->GPU1 or GPU0->GPU1)
  3. GPU 1 receives the data and computes Split 1 (second half of layers)
  4. When the next ubatch arrives, GPU 0 can start computing immediately without waiting for GPU 1 to finish

The 4-copy design lets the scheduler use copies in round-robin fashion — while GPU 0 is computing split N, split N-1’s results are being copied to GPU 1, and GPU 1 can start computing right away — avoiding GPU idle time waiting for data. This is similar in concept to CPU pipelining: by adding a small amount of storage resources (4 copy sets), computation and transfer can overlap.

Activation conditions (src/llama-context.cpp):

  • Multiple devices (model.n_devices() > 1)
  • Layer split mode (LLAMA_SPLIT_MODE_LAYER)
  • KQV cache also offloaded to GPU
  • Backend supports async computation and event synchronization

If these conditions are not met (e.g., only one GPU, or using --split-mode none), n_copies is 1 and the scheduler degenerates to sequential execution — each split waits for the previous one to complete. This is the actual situation for most users. Multi-GPU pipeline parallelism mainly targets large model (70B+) deployment scenarios where the model cannot fit in a single GPU’s VRAM.


Part B: Op Fusion and Graph Optimization

After the compute graph is split and before memory is actually allocated, each split undergoes a round of graph optimization. The core technique is Op Fusion — merging multiple small operations into a single large operation to reduce kernel launch overhead and intermediate tensor memory reads/writes. This is one of the key factors in GPU inference performance optimization.

Optimization Timing and Location

Graph optimization occurs at a specific point in the scheduling pipeline — after splitting, before graph copy. This timing is deliberate: optimization needs to know which backend each split runs on (because different backends support different fusions), but must complete before actual memory allocation (because fusion changes tensor counts and dependency relationships).

Graph Optimization's Position in the Scheduling Pipeline
Five-Pass Scan
Backend assignment
Graph Splitting
Create splits
Graph Optimization
Per-split
Graph Copy
Add copy nodes
Memory Allocation

Each split is optimized independently, executed by its corresponding backend’s graph_optimize callback:

// ggml/src/ggml-backend.cpp
// Inside ggml_backend_sched_split_graph()
for (int i = 0; i < sched->n_splits; i++) {
    struct ggml_backend_sched_split * split = &sched->splits[i];
    split->graph = ggml_graph_view(graph, split->i_start, split->i_end);

    // Each backend executes its own optimization
    ggml_backend_graph_optimize(sched->backends[split->backend_id], &split->graph);
}

// graph_optimize implementation -- simple callback dispatch
static void ggml_backend_graph_optimize(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
    if (backend->iface.graph_optimize != NULL) {
        backend->iface.graph_optimize(backend, cgraph);
    }
}

Different backends can register different optimization strategies — the CPU backend does no optimization (callback is NULL), while CUDA, Vulkan, and Metal each have their own fusion rules.

Typical Fusion Patterns

1. RMSNorm + Scale Fusion

In Transformers, RMSNorm is typically followed immediately by an element-wise multiplication (multiplying by weight). Fusing these two steps into a single kernel saves one full tensor read/write cycle.

Before fusion:                      After fusion:
x -> RMS_NORM -> tmp -> MUL -> y    x -> RMS_NORM_FUSED -> y
                       ^                                ^
                     weight                           weight

Before fusion requires:             After fusion only requires:
  1. Read x from VRAM                 1. Read x + weight from VRAM
  2. Write tmp to VRAM                2. Write y to VRAM
  3. Read tmp + weight from VRAM      (saves one full VRAM read/write)
  4. Write y to VRAM

Vulkan, Metal, OpenCL, Hexagon, and other backends all implement this fusion. Taking Vulkan as an example, it can further extend to longer fusion chains:

  • RMS_NORM + MUL: basic fusion — saves 1 global memory read/write
  • RMS_NORM + MUL + ROPE: three-stage fusion — normalization, scaling, and positional encoding completed in one pass, saving 2 read/writes
  • MUL_MAT + ADD: matrix multiply + bias fusion — adds bias directly at the end of the GEMM kernel, avoiding an extra element-wise kernel

2. Flash Attention

Flash Attention is not a traditional op fusion but an algorithm-level optimization — replacing the entire attention computation Q*K^T -> softmax -> *V with a single efficient operation GGML_OP_FLASH_ATTN_EXT. Traditional multi-step attention requires 3 kernel launches and 2 intermediate tensor global memory read/writes; Flash Attention compresses this to 1 kernel, completing all computation in SRAM using tiling.

The decision to enable Flash Attention happens at graph construction time (src/llama-graph.cpp):

const bool use_flash_attn = cparams.flash_attn && kq_b == nullptr;
if (use_flash_attn) {
    // Replace QK^T -> softmax -> V sequence with a single ggml_flash_attn_ext() op
    cur = ggml_flash_attn_ext(ctx0, q, k, v, kq_mask, kq_scale, ...);
} else {
    // Traditional multi-step attention
    kq = ggml_mul_mat(ctx0, k, q);           // QK^T
    kq = ggml_soft_max_ext(ctx0, kq, ...);   // softmax
    cur = ggml_mul_mat(ctx0, v, kq);         // *V
}

Flash Attention availability is automatically probed during context initialization (in the llama_context constructor in src/llama-context.cpp): a test graph is built and checked to see whether the GGML_OP_FLASH_ATTN_EXT node is assigned to the device hosting the KV cache. If the devices don’t match (indicating the backend doesn’t support it), it automatically degrades to traditional attention. This “build test graph -> check backend assignment” probing pattern is used in multiple places in llama.cpp — it avoids hard-coding a mapping of “which backend supports which operation” and instead lets the scheduler itself answer this question. Note the kq_b == nullptr condition — when attention has bias (e.g., ALiBi), Flash Attention is not applicable and falls back to the traditional path.

3. CUDA Attention Parallelism Optimization

The CUDA backend’s graph optimization is not about fusing operations but about reordering — identifying the fork-join pattern of Q/K/V parallel computation in Transformers and interleaving the three branches’ operations to extend tensor lifetimes and improve GPU stream parallel utilization. Specifically, in a standard Transformer layer, the Q/K/V projections are independent — the Q branch contains MatMul -> RoPE -> ..., with K and V being similar. In the original graph, these operations are laid out branch-by-branch (all Q operations -> all K operations -> all V operations); after reordering, they become interleaved (Q0, K0, V0, Q1, K1, V1, …). This interleaving lets the GPU release completed branches’ intermediate tensors sooner while also giving CUDA streams more opportunities for parallelism.

4. Gated Delta Net (GDN) Fusion

For the Gated Delta Net architecture (e.g., Qwen3.5, DeltaNet), multiple gate/delta/net operations are fused into a single GGML_OP_GATED_DELTA_NET operation. Support detection likewise happens automatically during context initialization (adjacent to Flash Attention probing, in the llama_context constructor). Like Flash Attention, GDN fusion is an algorithm-level optimization — the decision to use the fused version is made during graph construction, not during the graph_optimize phase afterward.

Fusion Safety Checks

Not any arbitrary operations can be fused. ggml provides ggml_can_fuse() (an inline function defined in ggml-impl.h) to check fusion feasibility:

  • Nodes to be fused must be within a valid range in the dependency graph
  • Intermediate tensors cannot be graph outputs (other nodes still need them)
  • View operations must be entirely contained within the fusion range
  • Fusion must not break data dependencies
  • All involved nodes must be in the same split (nodes across different backends cannot be fused)

These checks are performed uniformly by ggml_can_fuse(). If a check fails, graph_optimize skips that fusion candidate and preserves the original multi-step operations. Better to skip fusion than to produce incorrect computation results.

Op Fusion Summary

Graph optimization is designed to be modular — the core scheduler only calls the graph_optimize callback on each split, with specific fusion rules implemented by each backend independently. This lets CUDA do Q/K/V reordering, Vulkan do RMSNorm+MUL+ROPE three-stage fusion, and Metal do its own optimizations — without interfering with each other. For backends that don’t support optimization (like CPU), the callback is NULL and simply skipped.

This design has another benefit: new backends only need to implement the graph_optimize callback to gain fusion capabilities, without modifying core scheduling code. For example, when adding a new NPU backend, one simply implements the fusion patterns specific to that NPU in its graph_optimize.

It’s important to note that not all optimizations happen during the graph_optimize phase. Flash Attention and GDN replacement occur during the graph construction phase (build_graph()), because they change the semantics of operations (from multi-step operations to a single operation), not merely how they’re executed. Fusions during the graph_optimize phase are more “local” — they don’t change computation semantics, only execution.


Part C: Tensor Memory Allocation

Backend scheduling determines “where to execute,” Op Fusion determines “what to execute,” and the next question is “where does the memory come from.” llama.cpp has three categories of tensors with fundamentally different memory management strategies: weight tensors (long-lived), KV cache tensors (session-level), and intermediate activation tensors (compute-and-release). Understanding the allocation strategies for these three categories is key to grasping llama.cpp’s memory usage and performance characteristics.

Three Tensor Lifecycle Categories Compared

The table below summarizes the key differences in allocation strategy among the three tensor categories:

CategoryAllocation TimingLifetimeMemory LocationRelease TimingTypical Size
WeightsModel loadingModel lifetimemmap or GPU bufferModel destructionSeveral GB
KV CacheContext creationInference sessionPer-layer on corresponding deviceContext destructionHundreds of MB to several GB
Intermediate ActivationsEach computeSingle ubatchTemporary buffer managed by schedulerImmediately on refcount reaching zeroTens of MB

Let’s examine each category’s management strategy in depth.

Weight Tensors: mmap or Backend Buffer

Model weights are allocated at load time with a lifetime matching the model. The allocation method depends on whether mmap is enabled:

Weight Tensor Loading Path
load_data_for(tensor)
use_mmap?
mmap mode
tensor->data = mmap_addr + offset Zero-copy, points directly to file mapping
Non-mmap mode
Read from file into backend buffer (GPU VRAM or pinned CPU memory)

In mmap mode, the tensor’s data pointer points directly to the file-mapped region — no extra memory is needed, but data stays in main memory, so the GPU must read it across the bus each time it’s used.

In non-mmap mode, weights are read into a backend buffer (typically GPU VRAM) and accessed locally on the GPU for subsequent use. To speed up uploads, llama.cpp uses 4 pinned staging buffers (64MB for Direct I/O, 1MB in normal mode) forming a pipeline — one buffer performs I/O reads while another is being GPU DMA-transferred. This async upload mechanism shares the same design philosophy as Pipeline Parallelism: using multiple buffer sets to overlap I/O and computation.

The two modes suit different scenarios: mmap is ideal for CPU-only inference (avoids doubling memory), while non-mmap is ideal for GPU inference (data resides in GPU VRAM, avoiding repeated bus transfers). llama.cpp automatically selects based on the n_gpu_layers and use_mmap parameters — when all layers are offloaded to GPU, weights naturally use non-mmap mode for upload to GPU; when running entirely on CPU, mmap is the default.

KV Cache: Session-Level Allocation

The KV cache is pre-allocated all at once when the context is created, surviving the entire inference session. This means regardless of how many tokens are actually used, the KV cache always occupies a fixed amount of memory. Each Transformer layer has independent K and V tensors:

// src/llama-kv-cache.cpp (simplified)
// Create 3D tensor per layer: [embd_size, kv_size, n_stream]
ggml_tensor * k = ggml_new_tensor_3d(ctx, type_k, n_embd_k_gqa, kv_size, n_stream);
ggml_tensor * v = ggml_new_tensor_3d(ctx, type_v, n_embd_v_gqa, kv_size, n_stream);

The three dimensions explained:

  • First dimension n_embd_k_gqa / n_embd_v_gqa: embedding dimension per attention head multiplied by GQA group count. In GQA (Grouped-Query Attention) models, K/V have fewer heads than Q, so this dimension is also smaller
  • Second dimension kv_size: KV cache capacity (number of slots), controlled by the n_ctx parameter. This is the primary determinant of KV cache memory usage
  • Third dimension n_stream: controls multi-sequence isolation (1 for unified cache, otherwise equals n_seq_max)

type_k / type_v are configurable — f16 by default, but quantized types like q8_0 or q4_0 can be used to save VRAM.

The KV cache is allocated per-layer to each device — if a layer’s weights are on GPU, that layer’s KV cache is also on GPU (unless --kv-offload is disabled). This strategy ensures that K/V tensors are on the same device as the layer’s computation during attention, eliminating cross-device copies.

KV cache quantization type has a significant impact on VRAM usage: for a 4096-dimensional model with 4K context, a single layer’s K+V occupies 64MB at f16 but only 16MB at q4_0 — a 32-layer model saves over 1.5GB of VRAM. Users can control the quantization type via --cache-type-k and --cache-type-v parameters, trading off between precision and VRAM.

Intermediate Tensors: Graph Allocator

Intermediate activation tensors are the most frequently created and destroyed — each process_ubatch() call produces a large number of temporary tensors (attention scores, FFN intermediate results, etc.). These tensors are managed collectively by ggml_gallocr (graph allocator). Unlike weights and KV cache, intermediate tensors do not need to persist across ubatches — once a ubatch computation completes, all intermediate tensor memory can be reclaimed.

Allocation Strategy: Reference Counting + Immediate Release

Before allocation, the graph allocator statically analyzes the entire compute graph’s dependency relationships (not dynamic runtime tracking), tracking how many downstream nodes reference each tensor. This allows the allocator to determine each tensor’s lifetime before computation begins, enabling optimal memory reuse decisions. When the last consumer completes computation, that tensor’s memory is immediately reclaimed:

Reference Counting Release Flow
Tensor A
refcount = 2
Node B uses A
Node C uses A
B completes
refcount = 1, not released
C completes
refcount = 0, release immediately
A's memory reused by subsequent tensors

In-Place Operation Optimization

If an operation’s input tensor is no longer used by anyone after the operation (refcount=1, no views), the allocator lets the output directly reuse the input’s memory — zero extra allocation. Supported in-place operations include ADD, MUL, SCALE, RMS_NORM, SOFT_MAX, ROPE, etc.

In-place operations are very common in Transformers: ADD in residual connections (adding attention output to hidden state), RMS_NORM in normalization layers, ROPE for positional encoding — these operations’ inputs are typically consumed by only one downstream node, making them ideal candidates for in-place execution.

In-place operations and Op Fusion are complementary optimizations: Op Fusion reduces kernel launch count and intermediate tensor count, while in-place operations reduce memory allocation for existing tensors. Together, they keep peak intermediate activation VRAM well below the theoretical upper bound.

Memory Pool Strategy

The underlying ggml_dyn_tallocr uses a best-fit algorithm: maintaining a free-block list sorted by size, finding the smallest sufficiently large block for each allocation. If no suitable block exists, it extends the last block or creates a new chunk. Each backend has its own independent memory pool — GPU and CPU intermediate activation memory are completely isolated.

This design means peak intermediate activation VRAM usage is far smaller than the theoretical upper bound of “all intermediate tensors existing simultaneously.” Empirical measurements show that for a typical 7B parameter model, peak intermediate activation VRAM is usually only tens of MB, far less than weights (approximately 3-7 GB depending on quantization level) and KV cache (hundreds of MB to several GB depending on context length).

Allocation Flow

ggml_backend_sched_alloc_graph() is the entry point for graph memory allocation:

Graph Memory Allocation Flow
ggml_backend_sched_alloc_graph()
split_graph()
Split + optimize
Backend assignment changed?
Direct alloc_graph()
Reuse previous layout
reserve_n()
Recalculate buffer sizes
alloc_graph()
Allocate with new layout
Allocation complete

Key optimization: if the graph structure and backend assignments haven’t changed (which is typically the case during autoregressive decoding), the allocator reuses the previous memory layout, skipping the reserve step. reserve_n() is a relatively expensive operation — it needs to traverse the entire compute graph and calculate the maximum buffer size needed for each backend — so skipping it noticeably improves decoding latency.

The alloc phase also handles one detail: ensuring that split input duplicates (dup tensors) are allocated to the correct backend buffer. These dup tensors are “placeholders” created by Pass 5 that only receive real memory space at this point.

Additionally, alloc_graph() internally calls ggml_gallocr_alloc_graph(), which performs the reference counting analysis and in-place operation detection mentioned earlier — this is the key step enabling efficient intermediate tensor reuse.


The Complete Pipeline Across All Three Parts

Parts A/B/C are not three independent subsystems but a single complete pipeline. Each process_ubatch() call traverses this path:

Complete Path from Compute Graph to Execution
build_graph()
Build compute graph (previous article)
split_graph()
Part A: Five-pass scan + splitting
graph_optimize()
Part B: Per-split Op Fusion
alloc_graph()
Part C: Intermediate tensor memory allocation
graph_compute()
Execute computation (next article)

During autoregressive decoding, if the graph structure hasn’t changed (can_reuse() returns true, see the graph reuse mechanism in the previous article), build_graph() and split_graph() can be skipped entirely, starting directly from alloc_graph() — and alloc_graph() itself also skips the reserve step since the backend assignment hasn’t changed. This layered optimization brings the steady-state decoding overhead close to zero, returning the true bottleneck to computation itself.


Summary

This article covered three critical stages between compute graph construction and actual execution in llama.cpp:

  1. Backend Scheduling: The five-pass algorithm (initial assignment -> propagation -> upgrade optimization -> source tensor assignment -> graph splitting) assigns each compute graph node to an appropriate backend, creating split boundaries and inserting data copies at backend transitions. Pipeline parallelism overlaps computation and copying through 4 sets of tensor copies
  2. Op Fusion: A modular graph optimization framework where each backend implements its own fusion rules via the graph_optimize callback — RMSNorm+Scale fusion, Flash Attention, CUDA Q/K/V reordering, GDN fusion, etc. Safety is checked via ggml_can_fuse() before fusion
  3. Tensor Memory Allocation: Three-tier management — weights (mmap / GPU buffer, async pipelined upload), KV cache (session-level, supports quantization to save VRAM), intermediate activations (reference counting + in-place operations + best-fit memory pool)

This three-tier design gives llama.cpp precise control over memory usage: weights and KV cache are “known costs” that can be calculated exactly at load time; intermediate activations minimize peak usage through reference counting and in-place operations. The allocator also reuses the previous memory layout when the graph structure is unchanged, further reducing overhead during autoregressive decoding.

From the user’s perspective, these mechanisms answer some common questions: why isn’t --n-gpu-layers always “bigger is better” (it may cause frequent CPU-GPU copying)? Why does --flash-attn significantly reduce VRAM usage (it eliminates intermediate tensors for attention scores)? Why is autoregressive decoding overhead so low (graph reuse + memory layout reuse)? All of these stem from the scheduling and memory management mechanisms introduced in this article.

The next article will proceed to #7 Execution and Sampling, exploring how compute graphs are actually executed on each backend (CUDA kernel dispatch, Metal command buffer, CPU thread pool, etc.), and how the sampling strategy chain transforms logits into the final token.