Content on this site is AI-generated and may contain errors. If you find issues, please report at GitHub Issues .

Online Learning and Cost Optimization: Routers Need to Evolve Too

Online Learning and Cost Optimization: Routers Need to Evolve Too

Updated 2026-04-06

In static routing strategies, we pre-train a classifier and then use it indefinitely. But the real world is dynamic: model API prices change, user query distributions drift, and new models are constantly released. Online Learning enables routers to continuously adapt to these changes in production, balancing Exploration and Exploitation, ultimately achieving Pareto optimality between quality and cost.

Multi-armed Bandit: Balancing Exploration and Exploitation

Model selection can be framed as a Multi-armed Bandit (MAB) problem: each model is an “arm,” each routing decision is equivalent to “pulling” an arm, and the reward is defined as:

R=qualityλcostR = \text{quality} - \lambda \cdot \text{cost}

where λ\lambda controls cost sensitivity. The core challenge is the exploration-exploitation tradeoff:

  • Exploitation: Select the currently known best model to maximize immediate reward
  • Exploration: Try other models to gather more information and potentially discover better strategies

Classic algorithms include:

  • ε-greedy: Explore randomly with probability ϵ\epsilon, select the current best arm with probability 1ϵ1-\epsilon. Simple but exploration-inefficient.
  • Upper Confidence Bound (UCB): Compute an upper confidence bound for each arm and select the one with the highest UCB. Balances expected reward and uncertainty.
  • Thompson Sampling: A Bayesian approach that samples from the posterior distribution, naturally balancing exploration and exploitation.

In the routing setting, when each query arrives, the algorithm updates its value estimates for each model based on historical rewards, then selects the current best model. The figure below shows the exploration behavior of the three algorithms under identical settings:

Multi-armed Bandit Explorationε-greedy: 20% explore · 80% exploit · 0 total pullsGPT-4oPulled 0 timesEst. value: ?Manual pullClaude SonnetPulled 0 timesEst. value: ?Manual pullLlama-70BPulled 0 timesEst. value: ?Manual pullGPT-4o-miniPulled 0 timesEst. value: ?Manual pullRecent pulls:Cumulative net reward: 0.00
ε:20%

Over time, the algorithms gradually converge to the optimal policy, but at the cost of exploration overhead. Thompson Sampling typically performs best in Bandit problems because it allocates the exploration budget more intelligently.

RL Routing: Learning Routing Policies from Feedback

Reinforcement Learning (RL) models routing as a sequential decision problem:

  • State: Query features ss (embeddings, intent labels, context, etc.)
  • Action: Select a model a{m1,m2,,mk}a \in \{m_1, m_2, \ldots, m_k\}
  • Reward: R=quality(s,a)λcost(a)R = \text{quality}(s, a) - \lambda \cdot \text{cost}(a)

The router learns a policy π(as)\pi(a|s) that maximizes cumulative reward:

maxπEsp(s),aπ(s)[R(s,a)]\max_\pi \mathbb{E}_{s \sim p(s), a \sim \pi(\cdot|s)}[R(s, a)]

RouteLLM (2024) proposes training routers with preference data: given a query ss and responses from two models, humans label the better one. Through preference learning, the router learns to select the model most likely to produce a high-quality response for each query, while accounting for cost constraints. The trained router can then be fine-tuned online, continuously learning from new preference feedback.

The figure below shows how the RL routing reward signal evolves over time:

RL Routing: State → Action → Reward Loop简单翻译代码分析知识问答StateActionRewardUpdateStateQuery: "简单翻译" → 低复杂度RL Key Formula:R = Quality(a, q) - λ·Cost(a)λ controls cost sensitivity:Large λ → prefer cheap modelsSmall λ → prefer high-quality models
Step 1 / 4

Reward signal variance is high in the early phase (exploration) and gradually decreases as the policy converges (exploitation). RL methods are more powerful than Bandits because they can capture complex state-action dependencies, but training costs are higher.

Pareto Frontier and Cost Constraints

In multi-objective optimization, the Pareto Frontier defines the set of models in the quality-cost space that cannot be strictly dominated: if model AA is no worse than model BB in both quality and cost, and strictly better in at least one dimension, then AA Pareto-dominates BB. An ideal routing strategy should only switch among models on the Pareto frontier.

The figure below shows a typical Pareto frontier:

Pareto Frontier: Cost vs QualityClick models on the right to add/remove, observe Pareto frontier changesCost →Quality →Phi-3-miniLlama-8BGPT-4o-miniMixtral-8x7BLlama-70BClaude SonnetGPT-4oModel List:Phi-3-mini Llama-8B GPT-4o-mini Mixtral-8x7B Llama-70B Claude Sonnet GPT-4o ⭐ = Pareto optimalModels on the frontier have best quality at their costPareto Frontier: 6 modelsPhi-3-mini → Llama-8B → GPT-4o-mini → Llama-70B → Claude Sonnet → GPT-4oAdding new models may change frontier · ParetoBandit (2026) builds cost-aware selection on this

Click on models in the figure to see their quality-cost tradeoffs. Models on the Pareto frontier (blue) represent different optimal choices; which one to pick depends on the user’s cost budget BB or quality requirement QminQ_{\min}.

Dynamic Price Adaptation

In reality, API pricing is not fixed. Providers like OpenAI and Anthropic adjust prices based on demand or offer volume discounts. Dynamic Price Adaptation requires the router to monitor price changes in real time, recompute the Pareto frontier, and adjust routing decisions.

Dynamic Price AdaptationRouting strategy auto-adjusts to API price fluctuations · Current: T=0GPT-4oClaude SonnetLlama-70BT=0 Routing:GPT-4oPrice: $30/M tokensDynamic Routing StrategyGPT-4 price > $35 → fallback to Claude · Claude price > $17 → fallback to LlamaOnline learning (Bandit/RL) can auto-discover these patterns without manual threshold tuningLlama-70B self-hosted, constant price — advantage of self-hosted models
Time step:T=0

The figure above simulates a price fluctuation scenario: when a particular model’s price drops, it may move from outside the Pareto frontier to inside it, and the router should immediately increase its usage share. Online learning algorithms (e.g., Bandits) naturally adapt to such changes because they continuously update reward estimates.

Batch vs. Query-level Routing

Traditional per-query routing independently selects a model for each query — simple and efficient, but it ignores global constraints:

  • GPU concurrency limits: Self-hosted models have maximum concurrency; overload causes latency spikes
  • Total budget constraints: Production systems typically have daily or monthly total cost budgets
  • Volume discounts: Some APIs offer batch call discounts that per-query routing cannot exploit

Batch-level routing treats routing as a global optimization problem: given a batch of queries {q1,,qn}\{q_1, \ldots, q_n\} and constraints (budget BB, concurrency limit CC), find the assignment {(qi,mj)}\{(q_i, m_j)\} that maximizes total quality:

maxi=1nquality(qi,mj)s.t.i=1ncost(mj)B,concurrency(mj)C\max \sum_{i=1}^n \text{quality}(q_i, m_j) \quad \text{s.t.} \quad \sum_{i=1}^n \text{cost}(m_j) \leq B, \, \text{concurrency}(m_j) \leq C

Robust Batch-Level LLM Routing (2026) proposes using integer linear programming (ILP) or heuristic algorithms to solve batch routing, while accounting for model performance uncertainty (modeled with confidence intervals). Experiments show that batch routing can significantly improve quality over per-query routing under strict budget constraints.

Query-level vs Batch-level RoutingQuery-levelBatch-levelPer-query routing decisions:Q1: Translation (easy)Llama-8BQ2: Code (hard)GPT-4Q3: Q&A (easy)Llama-8BQ4: Analysis (hard)GPT-4Q5: Translation (easy)Llama-8BQ6: Reasoning (hard)GPT-4Query-level ResultGPT-4 usage count: 3· Total cost: $93Per-query: each query decided independently, ignoring global constraintsSimple but cannot handle GPU concurrency limits or global cost constraintsCost: $93 · Quality: optimal (per-query)

The figure above compares the cost-quality curves of the two routing approaches. Batch routing, through global coordination, achieves higher quality at the same budget, or lower cost at the same quality level. However, batch routing requires buffering queries (adding latency) and has higher computational overhead, making it best suited for offline or semi-online scenarios.

Summary

Online learning transforms routing from a “train once” system into one that continuously evolves:

  • Bandit algorithms provide a simple exploration-exploitation framework, well-suited for rapid adaptation to environmental changes
  • RL methods capture complex state-action dependencies but require more training data
  • Pareto optimization clarifies the multi-objective quality-cost tradeoff; dynamic price adaptation ensures the strategy remains optimal
  • Batch routing breaks through the limitations of per-query routing via global optimization, significantly improving efficiency under constrained scenarios

The core tradeoff is exploration cost vs. exploitation efficiency: exploration can discover better strategies but requires bearing the cost of trial and error. In production systems, conservative exploration strategies (small ϵ\epsilon or high confidence thresholds) are typical, with validation in offline environments before deployment.

The next article explores Mixture of Agents and collaborative routing: instead of “picking one model,” have multiple models work together — through voting, chaining, or parallel generation — to boost overall performance.