SGLang Programming Model and Structured Output

Why a Programming Model Is Needed

Real-world LLM applications are rarely as simple as “input a prompt, output text.” A typical RAG pipeline might look like:

User asks a question → LLM generates a search query
Retrieve documents → concatenate into context
LLM generates an answer based on the context
Self-check the answer (self-consistency: sample the same question 3 times and take the majority)

The traditional approach is to make multiple API calls, manually concatenating context each time. The problems:

Redundant computation: each call reprocesses the entire prompt (including prefixes already computed)
Sequential waiting: even if the three samples in step 4 can run in parallel, they can only be called one at a time
Unreliable formatting: asking the LLM to output JSON, but it frequently produces malformed output

SGLang’s core insight: if the inference engine can understand the application’s execution logic, it can optimize the entire pipeline end-to-end.

SGLang DSL Core Primitives

SGLang defines a concise set of primitives to describe LLM programs:

Primitive	Purpose	Example
`gen`	Generate text	`s = sgl.gen("analysis", max_tokens=100)`
`select`	Choose from options	`s = sgl.select("judgment", ["positive","negative"])`
`fork`	Parallel branching	`s1, s2 = sgl.fork(s, 2)`
`join`	Merge branches	`result = sgl.join([s1, s2])`
`append`	Concatenate context	`s = sgl.append("additional info...")`

These primitives look simple, but when combined they can express complex inference workflows:

# SGLang DSL Example: Multi-step Reasoning
s = sgl.gen("analyze", max_tokens=100)
s = sgl.select("judge", ["positive","negative","neutral"])
s1, s2 = sgl.fork(s, 2)
s1 = sgl.gen("explain1", max_tokens=50)
s2 = sgl.gen("explain2", max_tokens=50)
result = sgl.join([s1, s2])
result = sgl.append("Summary:")
result = sgl.gen("conclusion", max_tokens=80)

The key point is that this is not just syntactic sugar. SGLang’s execution engine analyzes the structure of the DSL program and automatically applies optimizations:

The two branches of fork share a prefix → RadixAttention automatically reuses the KV Cache
All candidates of select share a prefix → batched processing
append does not trigger recomputation → directly appends to the existing KV Cache

The Problem with Constrained Decoding

Getting LLMs to output structured data (such as JSON) is one of the most common requirements. But unconstrained generation has many issues:

// Expected output
{"name": "Alice", "age": 30}

// What the LLM might actually generate
{"name": Alice, "age": "thirty"}   // missing quotes, type error
{"name": "Alice" "age": 30}         // missing comma
{"name": "Alice"}                    // missing required field

Prompt engineering can mitigate but never eliminate this problem — the LLM’s token sampling process is inherently probabilistic, with no structural constraints.

FSM-Guided Generation

SGLang’s solution is to introduce Finite State Machine (FSM) constraints during the token sampling phase:

JSON Schema → Regular Expression: convert the user-defined output format into a regular expression
Regex → FSM: compile into a finite state machine, where each state corresponds to a position in the parser
FSM-guided sampling: at each step, compute the set of valid tokens based on the current FSM state, setting the logits of invalid tokens to -∞

Output:(click valid tokens to generate)

Token vocabulary (green=valid, gray=invalid):

The result: 100% format compliance. No matter how “creative” the LLM gets, the FSM mask ensures every output token conforms to the target schema.

Token Mask Generation Process

Let’s look at how the FSM determines valid tokens at each step:

JSON Schema: { "type": "object", "properties": { "name": { "type": "string" } }, "required": ["name"] }

Generated: (empty)▌

FSM State

OBJECT_START

Valid Chars

{

Token Mask:

{

"

hello

[

123

true

💡 JSON Object 必须以 { 开头，只有 { 合法

1 / 5

The entire process is pre-compiled: the FSM is built when the schema is first loaded, and at runtime it only needs an O(1) table lookup to retrieve the set of valid tokens for the current state, adding virtually no inference latency.

Jump-Forward Optimization

While FSM constraints guarantee correctness, each token still requires a full LLM forward pass — even when the output at certain positions is deterministic (such as JSON structural characters like {, ", :).

SGLang’s Jump-Forward optimization identifies deterministic states in the FSM (states with only one outgoing edge) and skips these positions entirely, bypassing the LLM:

Unconstrained

For structured output, deterministic fragments typically account for 40-70% of the total (most JSON structural characters are deterministic), meaning Jump-Forward can skip a large number of forward passes, achieving 2-5x speedup.

Performance Comparison

Combining correctness and speed:

Key trade-offs:

Unconstrained is fastest but unreliable — suitable for scenarios with relaxed formatting requirements
FSM-guided achieves near-perfect accuracy but is slightly slower — token mask computation and vocabulary scanning have overhead
FSM + Jump-Forward balances correctness and speed — this is SGLang’s recommended default

How to Use in Practice

The preceding sections explained the principles behind DSL primitives and FSM-constrained decoding. But how do you actually use these capabilities? SGLang uses a frontend-backend separation architecture: the server loads a model and handles inference, while clients send requests via HTTP APIs. Constraints are per-request — no config files needed; each request specifies its own schema.

Starting the Server

# Launch the SGLang inference server with a specific model
python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 30000 \
  --grammar-backend xgrammar   # Grammar backend: xgrammar (default), outlines, or llguidance

Once started, the server exposes OpenAI-compatible endpoints (/v1/chat/completions) and SGLang native endpoints (/generate). All FSM compilation and Jump-Forward optimizations are applied automatically on the server side.

Option 1: OpenAI-Compatible API

The most common approach. Use the standard OpenAI SDK, passing structured output constraints via response_format or extra_body:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="none")

# JSON Schema constraint — via standard OpenAI response_format field
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Capital of France in JSON format."}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "capital_info",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "population": {"type": "integer"}
                },
                "required": ["name", "population"]
            }
        }
    }
)

# Regex constraint — via extra_body (SGLang extension)
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_body={"regex": "(Paris|London|Berlin)"}
)

# EBNF grammar — via extra_body (SGLang extension)
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "European city info"}],
    extra_body={"ebnf": """root ::= city " is the capital of " country
city ::= "Paris" | "London" | "Berlin"
country ::= "France" | "England" | "Germany" """}
)

The three constraint types are mutually exclusive — each request can specify only one of json_schema, regex, or ebnf.

Option 2: Native Offline Engine

No HTTP involved — load the model directly in the same Python process for batch offline inference:

import sglang as sgl
import json
from pydantic import BaseModel, Field

class CapitalInfo(BaseModel):
    name: str = Field(..., pattern=r"^\w+$")
    population: int = Field(...)

llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

outputs = llm.generate(
    ["Tell me about the capital of France in JSON."],
    {"temperature": 0, "json_schema": json.dumps(CapitalInfo.model_json_schema())}
)

Constraint parameters (json_schema, regex, ebnf) go directly into the sampling_params dictionary, with identical behavior to the API approach.

Comparison

Approach	Communication	Best For	How Constraints Are Passed
OpenAI-Compatible API	HTTP (`/v1/chat/completions`)	Online services, migrating existing OpenAI code	`response_format` / `extra_body`
Native Engine	In-process calls	Batch offline inference	`sampling_params` dict
DSL Frontend (`@sgl.function`)	HTTP + session management	Complex multi-step orchestration (fork/join)	Inline in Python code

The first two options cover the vast majority of use cases. The DSL Frontend is mainly for complex orchestration requiring fork/join prefix reuse; for everyday structured output, the OpenAI API is sufficient.

Summary

SGLang’s programming model addresses two core problems:

Execution efficiency: through DSL primitives (gen/select/fork/join), the engine understands the application logic and maximizes prefix reuse via RadixAttention
Output reliability: through FSM-guided generation + Jump-Forward optimization, it achieves 100% format compliance with faster speed

From a broader perspective, SGLang represents the evolution of LLM inference engines from “general-purpose API services” to “programmable inference platforms” — it is not just about generating text faster, but about enabling the inference engine to understand and optimize the execution flow of entire LLM applications.