Research Papers research paper arxiv computer-vision image-recognition

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

arXivMarch 30, 202610 min read0 views

arXiv:2603.23376v2 Announce Type: replace Abstract: Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics- — Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

Authors:Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

View PDF

Abstract:Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.

Comments: Code: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Cite as: arXiv:2603.23376 [cs.CV]

(or arXiv:2603.23376v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.23376

arXiv-issued DOI via DataCite

Submission history

From: Yuzhi Chen [view email] [v1] Tue, 24 Mar 2026 16:07:09 UTC (13,213 KB) [v2] Fri, 27 Mar 2026 09:50:16 UTC (36,900 KB)

Original source

arXiv

https://arxiv.org/abs/2603.23376

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Laws & Regulation

Our Statement to the House Oversight Committee on the Federal Government’s Use of AI

June 5, 2025 — In a statement for the record at a hearing before the House Committee on Oversight and Government Reform on the federal government in the age of artificial intelligence, Director of Research Alice E. Marwick and Policy Director Brian J. Chen (with assistance from Jacob Metcalf, Meg Young, and Serena Oduro) lay [ ]