Live
Black Hat USADark ReadingBlack Hat AsiaAI Business🚀 Build a Professional Image Converter GUI in Python (Step-by-Step)DEV CommunityClaude Code Hooks: How to Auto-Format, Lint, and Test on Every SaveDev.to AIFunctional Emotions in Large Language Models: What Anthropic Found Inside ClaudeMedium AIWhy Nobody Is Testing AI Agent Security at Scale — And How Swarm Simulation Could Change ThatDev.to AIThe 10 Claude “Plugins” You Actually Need in 2026Medium AIHow AI Is Changing the Way We Build Online BusinessesDev.to AI5 Patterns for Building Resilient Event-Driven IntegrationsDEV CommunityAGI Won’t Automate Most Jobs—Economist Reveals Why They’re Not Worth ItDev.to AIThe AI Agent's Guide to Building a Writing PortfolioDev.to AIMy Claude Code Buddy Moved Into My MacBook's Notch and I Can't Stop Looking at ItDEV CommunityChoosing an AI Agent Orchestrator in 2026: A Practical ComparisonDev.to AII Turned My MacBook's Notch Into a Control Center for AI Coding AgentsDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI Business🚀 Build a Professional Image Converter GUI in Python (Step-by-Step)DEV CommunityClaude Code Hooks: How to Auto-Format, Lint, and Test on Every SaveDev.to AIFunctional Emotions in Large Language Models: What Anthropic Found Inside ClaudeMedium AIWhy Nobody Is Testing AI Agent Security at Scale — And How Swarm Simulation Could Change ThatDev.to AIThe 10 Claude “Plugins” You Actually Need in 2026Medium AIHow AI Is Changing the Way We Build Online BusinessesDev.to AI5 Patterns for Building Resilient Event-Driven IntegrationsDEV CommunityAGI Won’t Automate Most Jobs—Economist Reveals Why They’re Not Worth ItDev.to AIThe AI Agent's Guide to Building a Writing PortfolioDev.to AIMy Claude Code Buddy Moved Into My MacBook's Notch and I Can't Stop Looking at ItDEV CommunityChoosing an AI Agent Orchestrator in 2026: A Practical ComparisonDev.to AII Turned My MacBook's Notch Into a Control Center for AI Coding AgentsDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

arXivMarch 26, 202610 min read0 views
Source Quiz

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that t — Jai Bardhan, Patrik Drozdik, Josef Sivic

View PDF HTML (experimental)

Abstract:Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

Comments: 34 pages, 11 figures, 12 tables

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.25685 [cs.RO]

(or arXiv:2603.25685v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2603.25685

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jai Bardhan [view email] [v1] Thu, 26 Mar 2026 17:36:08 UTC (5,756 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Persistent …researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 233 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers