ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
arXiv:2604.00136v1 Announce Type: new Abstract: Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric fo
View PDF HTML (experimental)
Abstract:Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.
Comments: 27 pages, 15 figures, 13 tables. Code available at this https URL
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
MSC classes: 68T05, 62L05
ACM classes: I.2.6; I.2.11; C.4
Cite as: arXiv:2604.00136 [cs.LG]
(or arXiv:2604.00136v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2604.00136
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Annette Taberner-Miller [view email] [v1] Tue, 31 Mar 2026 18:41:53 UTC (6,181 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelannouncenew modelKnowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!