Research Papers research paper arxiv machine-learning deep-learning

Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards

arXivMarch 31, 202610 min read0 views

arXiv:2603.27884v1 Announce Type: new Abstract: We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$ under mild conditions, where $d$ is the feature dimension, $H$ is the horizon, and $K$ is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linea — Kihyun Yu, Seoungbin Bae, Dabeen Lee

View PDF HTML (experimental)

Abstract:We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$ under mild conditions, where $d$ is the feature dimension, $H$ is the horizon, and $K$ is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward functions change across episodes. In addition, we extend weighted ridge regression-based parameter estimation to the constrained setting, allowing us to construct tighter confidence intervals that are crucial for deriving the near-optimal regret bound.

Subjects:

Machine Learning (cs.LG); Optimization and Control (math.OC)

Cite as: arXiv:2603.27884 [cs.LG]

(or arXiv:2603.27884v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.27884

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Kihyun Yu [view email] [v1] Sun, 29 Mar 2026 21:51:33 UTC (160 KB)

Original source

arXiv

https://arxiv.org/abs/2603.27884

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ProductsLive

[D] Physicist-turned-ML-engineer looking to get into ML research. What's worth working on and where can I contribute most?

After years of focus on building products, I'm carving out time to do independent research again and trying to find the right direction. I have stayed reasonably up-to-date regarding major developments of the past years (reading books, papers, etc) ... but I definitely don't have a full understanding of today's research landscape. Could really use the help of you experts :-) A bit more about myself: PhD in string theory/theoretical physics (Oxford), then quant finance, then built and sold an ML startup to a large company where I now manage the engineering team. Skills/knowledge I bring which don't come as standard with Physics: Differential Geometry Topology (numerical solution of) Partial Differential Equations (numerical solution of) Stochastic Differential Equations Quantum Field Theory

Reddit r/MachineLearning

1m16 minutes ago

Research Papers

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Analysis of behavioral consistency in large language model agents reveals that while consistent performance correlates with higher accuracy, consistency can amplify both correct and incorrect interpretations, emphasizing that accurate interpretation is more crucial than execution consistency for production deployment. (2 upvotes on HuggingFace)

HuggingFace Papers

2m8 days ago

Research PapersRecent

A Survey of On-Policy Distillation for Large Language Models

On-Policy Distillation for large language models unifies diverse approaches through an f-divergence framework organized by feedback signals, teacher access, and loss granularity. (4 upvotes on HuggingFace)

HuggingFace Papers

2m1 day ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 171 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

HuggingFace Papers

2m8 days ago

Research PapersRecent

A Survey of On-Policy Distillation for Large Language Models

HuggingFace Papers

2m1 day ago

Research Papers

Brevity Constraints Reverse Performance Hierarchies in Language Models

Large language models can underperform smaller ones due to verbose responses that introduce errors, but constraining output length reveals their superior capabilities and improves performance across benchmarks. (16 upvotes on HuggingFace)

HuggingFace Papers

2m23 days ago

Research PapersRecent

Reasoning Shift: How Context Silently Shortens LLM Reasoning

Reasoning behaviors in large language models compress under varied contextual conditions, potentially impacting performance on complex tasks despite maintaining accuracy on simpler ones. (22 upvotes on HuggingFace)

HuggingFace Papers

2m1 day ago