Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessDeveloper’s Guide to Building ADK Agents with SkillsGoogle Developers BlogCargill Wins 2026 BIG Artificial Intelligence Excellence Award - foodmarket.comGoogle News: AIMeet the Agentic AI Design-to-Source Workspace for PLM: From CAD to Confident Sourcing Decisions - Oracle BlogsGNews AI agenticYouTube blasted by hundreds of experts over ‘AI slop’ videos served up to kidsFast Company TechZenity Emphasizes Security Controls for Expanding Enterprise AI Agent Ecosystems - TipRanksGoogle News: AI SafetyApono Uses Gamified AI Security Exercise to Engage Cloud Security Community - TipRanksGoogle News: AI SafetyUniversity of Colorado delays student rollout of ChatGPT Edu - Boulder Daily CameraGoogle News: ChatGPTSpaceX finally files for IPO, targets $1.75 trillion valuationArs TechnicaMeta’s natural gas binge could power South DakotaTechCrunch AIYour AI Vendor's Worst Enemy Is Its Own Development Pipeline - GovInfoSecurityGoogle News: Machine LearningLegal AI startup Legora hits $100 million in annual recurring revenueBusiness InsiderAnthropic's leaked AI coding tool has been cloned over 8,000 times on GitHub despite mass takedownsThe DecoderBlack Hat USADark ReadingBlack Hat AsiaAI BusinessDeveloper’s Guide to Building ADK Agents with SkillsGoogle Developers BlogCargill Wins 2026 BIG Artificial Intelligence Excellence Award - foodmarket.comGoogle News: AIMeet the Agentic AI Design-to-Source Workspace for PLM: From CAD to Confident Sourcing Decisions - Oracle BlogsGNews AI agenticYouTube blasted by hundreds of experts over ‘AI slop’ videos served up to kidsFast Company TechZenity Emphasizes Security Controls for Expanding Enterprise AI Agent Ecosystems - TipRanksGoogle News: AI SafetyApono Uses Gamified AI Security Exercise to Engage Cloud Security Community - TipRanksGoogle News: AI SafetyUniversity of Colorado delays student rollout of ChatGPT Edu - Boulder Daily CameraGoogle News: ChatGPTSpaceX finally files for IPO, targets $1.75 trillion valuationArs TechnicaMeta’s natural gas binge could power South DakotaTechCrunch AIYour AI Vendor's Worst Enemy Is Its Own Development Pipeline - GovInfoSecurityGoogle News: Machine LearningLegal AI startup Legora hits $100 million in annual recurring revenueBusiness InsiderAnthropic's leaked AI coding tool has been cloned over 8,000 times on GitHub despite mass takedownsThe Decoder

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.22273v2 Announce Type: replace Abstract: The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand stat — Zakaria Mhammedi, James Cohan

View PDF

Abstract:The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

Subjects:

Machine Learning (cs.LG)

Cite as: arXiv:2603.22273 [cs.LG]

(or arXiv:2603.22273v3 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.22273

arXiv-issued DOI via DataCite

Submission history

From: Zakaria Mhammedi [view email] [v1] Mon, 23 Mar 2026 17:56:52 UTC (3,793 KB) [v2] Fri, 27 Mar 2026 17:44:46 UTC (3,796 KB) [v3] Mon, 30 Mar 2026 17:14:06 UTC (3,796 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Decoupling …researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 200 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers