LLM Inference Engine Landscape: vLLM, SGLang, Ollama, and TensorRT-LLM

Why Inference Engines Are Needed

Running LLM inference directly with transformers.generate() has three critical bottlenecks:

Memory waste: Pre-allocating max_seq_len of KV Cache for each request, while actual usage is often less than half
No concurrency: One request occupies the entire GPU, forcing others to queue
Low throughput: Without batching optimizations, the GPU’s compute capacity sits idle most of the time

The core mission of an inference engine is to solve these three problems: efficient memory management, intelligent request scheduling, and maximizing GPU utilization.

The Four Major Engines and Their Design Philosophies

The four mainstream LLM inference engines each have distinct focuses:

vLLM (UC Berkeley, 2023): Founded on PagedAttention, its core goal is maximizing serving throughput. It borrows OS virtual memory concepts to manage KV Cache, eliminating memory fragmentation. It has the most mature ecosystem and the largest community, and its OpenAI-compatible API makes it the default choice for cloud deployment.

SGLang (LMSYS, 2023): Emphasizes the combination of programmability and high performance. RadixAttention provides more flexible prefix caching than vLLM, its original DSL programming model supports complex multi-step inference pipelines, and Compressed FSM delivers the fastest structured output. Ideal for complex LLM applications that require precise format control.

Ollama + llama.cpp: Local-first, ease-of-use-first. Install and run with a single command. The GGUF quantization format supports CPU and consumer-grade GPUs. It trades peak throughput for an out-of-the-box experience, making it the go-to for individual developers and local experimentation.

TensorRT-LLM (NVIDIA): Deep integration with the NVIDIA hardware ecosystem. FP8 quantization, inflight batching, and custom kernels squeeze every last drop of performance on H100/B200. The trade-off is low flexibility, a steep learning curve, and support limited to NVIDIA GPUs.

The design philosophies of these four engines can be understood as a triangle: throughput, programmability, and ease of use — no engine can achieve the ultimate in all three dimensions simultaneously.

Request Processing Flow Comparison

The request processing flows of three engines reflect their respective design priorities:

Cloud Serving (vLLM)

vLLM’s flow centers on the Scheduler, with all optimizations focused on “fitting more requests at the same time.” Ollama’s flow is the shortest and most direct — a single-request model suited for interactive use. SGLang’s flow adds two extra stages: IR orchestration and constrained decoding — it optimizes not just inference speed, but also “how programmers use LLMs.”

Key Technologies Overview

The performance differences between these engines stem from underlying key technical innovations. We’ll build a high-level understanding here, with subsequent articles diving deep into each:

Technology	Core Idea	Pioneered By	Deep Dive Article
PagedAttention	Virtual memory paging for KV Cache	vLLM	Next article
Continuous Batching	Release on completion, dynamically fill new requests	Orca	Next article
RadixAttention	Radix tree for prefix cache management	SGLang	Prefix Caching
Constrained Decoding	FSM constraints + jump-forward acceleration	SGLang	SGLang Programming Model
Chunked Prefill	Split long prompts into chunks for mixed scheduling	Sarathi	Scheduling and Preemption

Static vs Continuous Batching is fundamental to understanding all engines — static batching must wait for the slowest request to finish, leaving the GPU largely idle; continuous batching releases and fills request by request:

Technology Evolution Timeline

From Orca pioneering continuous batching in 2022 to today, the inference engine field has experienced explosive innovation. Engines have evolved from independent innovation to mutual absorption — vLLM added prefix caching, SGLang optimized batch scheduling, and TensorRT-LLM also adopted PagedAttention.

Selection Guide

Not sure which one to choose? Answer a few simple questions:

Of course, this is only a rough guide. Actual selection also needs to consider: model size, request patterns (long/short context), SLA requirements, team tech stack, hardware budget, and other factors. The safest strategy is to start with vLLM (most mature ecosystem), and evaluate SGLang (structured output) or TensorRT-LLM (peak performance) when you hit bottlenecks.

Recommended Learning Resources

If you want a deeper understanding of LLM inference engines, here are our curated resources:

Classic Papers

Kwon et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention” — The vLLM core paper (SOSP 2023), proposing the PagedAttention algorithm that borrows OS virtual memory and paging mechanisms to manage KV Cache, achieving 2-4x throughput improvement.
Zheng et al. “SGLang: Efficient Execution of Structured Language Model Programs” — The SGLang paper, proposing RadixAttention for KV Cache prefix reuse and compressed finite state machines for accelerating structured output decoding, achieving up to 6.4x throughput improvement across various tasks.

Blogs and Tutorials (with illustrations)

vLLM official blog post “Easy, Fast, and Cheap LLM Serving with PagedAttention” — Uses diagrams to intuitively demonstrate how PagedAttention works and its performance comparisons — a more digestible introductory version than the paper. (vllm.ai/blog/vllm)
LMSYS “Fast and Expressive LLM Inference with RadixAttention and SGLang” — The SGLang official blog post, explaining the RadixAttention KV Cache reuse mechanism and frontend DSL design, with architecture diagrams and performance comparisons. (lmsys.org/blog/2024-01-17-sglang/)
Anyscale “How continuous batching enables 23x throughput in LLM inference” — Detailed explanation of the continuous batching mechanism (used by both vLLM and SGLang), comparing throughput differences against static batching, benchmarking multiple frameworks.

Official Documentation and Source Code

vLLM GitHub Repository & Documentation — Covers PagedAttention, continuous batching, CUDA graph, various quantization methods, multi-node distributed inference, and more. (github.com/vllm-project/vllm / docs.vllm.ai)
SGLang GitHub Repository & Documentation — Includes RadixAttention, prefill-decode disaggregation, speculative decoding, various parallelism strategies. Supports NVIDIA/AMD/Intel/TPU multi-hardware backends. (github.com/sgl-project/sglang / docs.sglang.ai)

Summary

Inference engines are the critical infrastructure for LLMs to go from “can run” to “can be used.” Understanding their design philosophies and core technologies is essential knowledge for LLM engineering. Next, we will dive into each key technology: starting with PagedAttention’s memory management, progressively understanding the complete technology stack of modern inference engines.