Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechFilevine Emphasizes Ethical AI and Autonomous Systems in Legal Tech Strategy - TipRanksGNews AI ethicsPriced Out by AI: The Memory Chip Crisis Hitting Every ConsumerHacker News AI TopShow HN: AgentDog – Open-source dashboard for monitoring local AI agentsHacker News AI TopAI Enforcement Accelerates as Federal Policy Stalls and States Step In - Morgan LewisGNews AI USAGemma 4 and Qwen3.5 on shared benchmarksReddit r/LocalLLaMA[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on BlackwellReddit r/MachineLearningThe energy and environmental impact of AI and how it undermines democracy - greenpeace.orgGNews AI energyShow HN: A TUI for checking and comparing cloud and AI pricingHacker News AI TopAttorney General Pam Bondi pushed outAxios TechShow HN: Screenbox – Self-hosted virtual desktops for AI agentsHacker News AI TopBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechFilevine Emphasizes Ethical AI and Autonomous Systems in Legal Tech Strategy - TipRanksGNews AI ethicsPriced Out by AI: The Memory Chip Crisis Hitting Every ConsumerHacker News AI TopShow HN: AgentDog – Open-source dashboard for monitoring local AI agentsHacker News AI TopAI Enforcement Accelerates as Federal Policy Stalls and States Step In - Morgan LewisGNews AI USAGemma 4 and Qwen3.5 on shared benchmarksReddit r/LocalLLaMA[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on BlackwellReddit r/MachineLearningThe energy and environmental impact of AI and how it undermines democracy - greenpeace.orgGNews AI energyShow HN: A TUI for checking and comparing cloud and AI pricingHacker News AI TopAttorney General Pam Bondi pushed outAxios TechShow HN: Screenbox – Self-hosted virtual desktops for AI agentsHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

Off-Policy Learning with Limited Supply

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.18702v2 Announce Type: replace Abstract: We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the ite — Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi, Yusuke Narita, Yasuo Yamamoto, Nobuyuki Shimizu, Yuta Saito

View PDF HTML (experimental)

Abstract:We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

Comments: Published as a conference paper at WWW 2026

Subjects:

Machine Learning (cs.LG)

Cite as: arXiv:2603.18702 [cs.LG]

(or arXiv:2603.18702v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.18702

arXiv-issued DOI via DataCite

Submission history

From: Koichi Tanaka [view email] [v1] Thu, 19 Mar 2026 10:01:39 UTC (1,809 KB) [v2] Mon, 30 Mar 2026 02:04:51 UTC (1,813 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Off-Policy …researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 106 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!