Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingAdvanced Compact Patterns for Web3 DevelopersDEV CommunityA conversation on concentration of powerLessWrongDecoding the Black Box: LLM Observability with LangSmith & Helicone for Local ModelsDEV CommunityBest Free Snyk Alternatives for Vulnerability ScanningDEV CommunityKey AI, Cybersecurity, and Privacy Takeaways from the NAIC 2026 Spring Meeting - JD SupraGoogle News: AIAI LEGAL KEYNOTE SPEAKER & ARTIFICIAL INTELLIGENCE LAW FUTURIST FOR EVENTS - futuristsspeakers.comGNews AI legalOpenAI Buys Streaming Show ‘TBPN,’ Aiming to Change Narrative on A.I. - The New York TimesGoogle News: AIGateway Capital announces first close of $25M Fund IITechCrunch AIBrazil’s machinery industry drives innovation in automation, AI, IoT and clean energy - The National Law ReviewGNews AI BrazilHow Cos. Can Navigate The Patchwork Of AI Safety Bills - Law360Google News: AI SafetyFailed AI tractor company lays off all employees, abandons Bay Area headquartersHacker News AI TopBlack Hat USADark ReadingBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingAdvanced Compact Patterns for Web3 DevelopersDEV CommunityA conversation on concentration of powerLessWrongDecoding the Black Box: LLM Observability with LangSmith & Helicone for Local ModelsDEV CommunityBest Free Snyk Alternatives for Vulnerability ScanningDEV CommunityKey AI, Cybersecurity, and Privacy Takeaways from the NAIC 2026 Spring Meeting - JD SupraGoogle News: AIAI LEGAL KEYNOTE SPEAKER & ARTIFICIAL INTELLIGENCE LAW FUTURIST FOR EVENTS - futuristsspeakers.comGNews AI legalOpenAI Buys Streaming Show ‘TBPN,’ Aiming to Change Narrative on A.I. - The New York TimesGoogle News: AIGateway Capital announces first close of $25M Fund IITechCrunch AIBrazil’s machinery industry drives innovation in automation, AI, IoT and clean energy - The National Law ReviewGNews AI BrazilHow Cos. Can Navigate The Patchwork Of AI Safety Bills - Law360Google News: AI SafetyFailed AI tractor company lays off all employees, abandons Bay Area headquartersHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning

arXivMarch 26, 202610 min read0 views
Source Quiz

Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream tasks of interest remains a challenge. In this work, we study $\textit{online}$ zero-shot RL for quadrupedal control on real robotic systems, building upon the Forward-Backward (FB) algorithm. We observ — Jiajun Hu, Nuria Armengol Urpi, Jin Cheng

View PDF HTML (experimental)

Abstract:Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream tasks of interest remains a challenge. In this work, we study $\textit{online}$ zero-shot RL for quadrupedal control on real robotic systems, building upon the Forward-Backward (FB) algorithm. We observe that undirected exploration yields low-diversity data, leading to poor downstream performance and rendering policies impractical for direct hardware deployment. Therefore, we introduce FB-MEBE, an online zero-shot RL algorithm that combines an unsupervised behavior exploration strategy with a regularization critic. FB-MEBE promotes exploration by maximizing the entropy of the achieved behavior distribution. Additionally, a regularization critic shapes the recovered policies toward more natural and physically plausible behaviors. We empirically demonstrate that FB-MEBE achieves and improved performance compared to other exploration strategies in a range of simulated downstream tasks, and that it renders natural policies that can be seamlessly deployed to hardware without further finetuning. Videos and code available on our website.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.25464 [cs.LG]

(or arXiv:2603.25464v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.25464

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Núria Armengol Urpí [view email] [v1] Thu, 26 Mar 2026 14:07:01 UTC (4,767 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Maximum Ent…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 175 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!