Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Scheduling and Preemption: The Inference Engine Scheduler

Scheduling and Preemption: The Inference Engine Scheduler

Updated 2026-04-06

The Scheduler: Brain of the Inference Engine

In the previous article, we solved the memory management problem (PagedAttention) and the batching problem (Continuous Batching). But a key question remains: when GPU resources are insufficient to serve all requests simultaneously, who goes first? Who gets paused?

This is the Scheduler’s responsibility. It is the “brain” of the inference engine, making a decision at every decode iteration: which requests continue running, which new requests can join, and which running requests need to be preempted to make room.

Request State Machine

Each request in the Scheduler has four states:

Scheduler Request State MachineClick buttons below to trigger state transitionsGPU slot freeEOS generatedOut of memorySwap completedWaitingRunningSwappedFinishedGPU slot freeRunning
  • Waiting: the request has arrived and is queued for a GPU slot
  • Running: actively executing prefill or decode on the GPU
  • Swapped: preempted, with KV Cache moved to CPU or discarded, awaiting restoration
  • Finished: generated an EOS token or reached max_length, all resources released

Core transitions: Waiting → Running (scheduling), Running → Swapped (preemption), Swapped → Running (restoration). After each iteration, the Scheduler re-evaluates the state of all requests.

Scheduling Policies

The most basic policy is FCFS (First Come, First Served), but it does not account for request priority. Production environments typically require more sophisticated policies:

Scheduling Policy Comparison: Same Request BatchSwitch policy to see Gantt chart changesFCFSPriorityShortest Job FirstSlot 0Slot 10123456789R1R2VIPR3R4Avg Wait Time: 1.0 steps | Avg Completion Time: 4.5 stepsFCFS: First-come-first-serve, simple but VIP requests may wait

FCFS: Simple and fair, but VIP users’ requests may get stuck behind a large volume of ordinary requests.

Priority Scheduling: Assigns different priority levels to requests (e.g., VIP users, paid users, free users), with higher-priority requests getting GPU slots first. The downside is that low-priority requests may be “starved.”

Shortest Job First (SJF): Estimates request length and prioritizes shorter requests, minimizing average completion time. However, long requests may be continually postponed.

Preemption Mechanisms

When GPU memory is insufficient, the Scheduler must select one or more Running requests for preemption, releasing the GPU memory occupied by their KV Caches. There are two strategies:

Trigger Preemption: Out of Memory
Scenario: New high-priority request arrives, GPU memory insufficientRunning Request AKV Cache: 2GB (20 blocks)80% tokens generatedNew Request B (VIP)Needs KV Cache: 3GBGPU out of memory!Scheduler must preempt request A to free memory — but A's KV Cache cannot be lost?Two strategies: Swap (to CPU) vs Recompute (discard & recompute)

How to choose? vLLM’s default policy is: if a request has already generated many tokens (large KV Cache), prefer Swap (high transfer cost but no wasted computation); if the request just started (small KV Cache), prefer Recompute (low recomputation cost and avoids PCIe transfer).

Chunked Prefill

Long prompts introduce another problem: the prefill phase needs to process thousands of tokens at once, monopolizing the GPU during this time and blocking all decode requests. Users experience this directly as “stuttering” — a streaming response that was flowing smoothly suddenly freezes.

Chunked Prefill (Sarathi, 2023) solves this by splitting long prompts into fixed-size chunks, with each chunk occupying only one iteration, leaving the remaining iterations for decode requests:

Chunked Prefill:长 Prompt 分块调度不分块:Prefill 独占 GPU新请求Long Prefill (8 iteration 独占)Decode A被阻塞 — TTFT 飙升!Decode B被阻塞 — 用户感知卡顿!分块:Prefill 与 Decode 交替执行新请求P1P2P3P4Decode ADDDDDecode BDDDDChunked Prefill 将长 prompt 切成小块,与 decode 交替执行 — Decode 请求不再被阻塞

Sarathi-Serve further optimizes by piggybacking prefill chunks and decode requests within each iteration — leveraging the complementary nature of prefill (compute-bound) and decode (memory-bound). Mixing both simultaneously maximizes utilization of both GPU compute units and memory bandwidth.

The Scheduling Trade-off

Throughput, latency, and fairness — you cannot optimize all three at once. The Scheduler’s configuration is essentially choosing a position within this triangle:

Scheduling Trade-off: Throughput vs Latency vs FairnessSelect optimization goal to see strategy and effectsThroughput FocusLatency FocusFairness FocusStrategy: Large batch size + delayed preemptionMaximize GPU utilization, suitable for offline batch tasksThroughput95Latency40Fairness50

In practice, these parameters are typically configurable:

  • max_num_seqs: maximum concurrent requests (higher → more throughput, lower → less latency)
  • enable_chunked_prefill: whether to enable chunked prefill (on → lower latency, off → higher throughput)
  • preemption_mode: preemption strategy (swap / recompute)
  • scheduling_policy: scheduling algorithm selection

Summary

The Scheduler is the most “engineering-heavy” part of an inference engine — there is no one-size-fits-all optimal solution, and tuning is required for each scenario. But the core principles are clear: a state machine manages request lifecycles, preemption handles resource shortages, chunked prefill eliminates long-prompt blocking, and the trade-off triangle guides configuration choices.

Further Reading