Live
Black Hat USAAI BusinessBlack Hat AsiaAI Business‘I’m not dumb’: Hong Kong’s London trade office manager denies running spy networkSCMP Tech (Asia AI)ciflow/torchtitan/178947: Update on "add API to check if a tensor is symm-mem-tensor"PyTorch ReleasesGoogle Panda Algorithm: Understanding Its Impact and How to Recover from Its ConsequencesDev.to AIComplete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM InferenceDev.to AIciflow/trunk/178016: simplify testPyTorch Releasesciflow/torchtitan/178016: simplify testPyTorch ReleasesI Built an AI Coloring Page Generator — Got 500+ Organic Visits in One DayDev.to AIHeated Rivalry: A Guide to the Best Books, Movies, Video Games, and Podcasts for Fans of the Hit SeriesDev.to AIWe're running an AI-authored research workshop for Northeast India's 200+ languages - and publishing everything openlyDev.to AIciflow/torchtitan/177627: UpdatePyTorch Releasesciflow/torchtitan/177621: UpdatePyTorch Releasestrunk/d52b2f548aa3cfcfcd499fcba764fccf29628de6: [inductor] Enable precompiled headers in fbcode (#178870) (#178870)PyTorch ReleasesBlack Hat USAAI BusinessBlack Hat AsiaAI Business‘I’m not dumb’: Hong Kong’s London trade office manager denies running spy networkSCMP Tech (Asia AI)ciflow/torchtitan/178947: Update on "add API to check if a tensor is symm-mem-tensor"PyTorch ReleasesGoogle Panda Algorithm: Understanding Its Impact and How to Recover from Its ConsequencesDev.to AIComplete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM InferenceDev.to AIciflow/trunk/178016: simplify testPyTorch Releasesciflow/torchtitan/178016: simplify testPyTorch ReleasesI Built an AI Coloring Page Generator — Got 500+ Organic Visits in One DayDev.to AIHeated Rivalry: A Guide to the Best Books, Movies, Video Games, and Podcasts for Fans of the Hit SeriesDev.to AIWe're running an AI-authored research workshop for Northeast India's 200+ languages - and publishing everything openlyDev.to AIciflow/torchtitan/177627: UpdatePyTorch Releasesciflow/torchtitan/177621: UpdatePyTorch Releasestrunk/d52b2f548aa3cfcfcd499fcba764fccf29628de6: [inductor] Enable precompiled headers in fbcode (#178870) (#178870)PyTorch Releases

Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2510.25311v2 Announce Type: replace-cross Abstract: Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochas — Sagalpreet Singh, Rishi Saket, Aravindan Raghuveer

View PDF HTML (experimental)

Abstract:Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.

Comments: 27 pages, 6 figures

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2510.25311 [cs.LG]

(or arXiv:2510.25311v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2510.25311

arXiv-issued DOI via DataCite

Submission history

From: Sagalpreet Singh [view email] [v1] Wed, 29 Oct 2025 09:23:21 UTC (353 KB) [v2] Sat, 28 Mar 2026 14:07:59 UTC (413 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Dense and D…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 164 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers