Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Tool That Built the Modern World Is Still the Most Powerful Thing in an Engineer’s ArsenalMedium AII Tested AI Coding Assistants on the Same Full-Stack App — Here’s the Real WinnerMedium AIIs the Arrow of Time a Crucial Missing Component in Artificial Intelligence?Medium AIAutomation vs AI: Not Just Similar — They Solve Fundamentally Different ProblemsMedium AIWalmart's AI Checkout Converted 3x Worse. The Interface Is Why.DEV Community✨ Why Humanity Still Moves Toward AI.Medium AIPredicting 10 Minutes in 1 Square Meter: The Ultimate AI Boundary?DEV CommunityOracle Database 26ai: The World’s First AI-Native Database Just Changed EverythingMedium AIGetting Data from Multiple Sources in Power BIDEV CommunityAI APIs That Simplify Complex FeaturesMedium AIPART FIVE – THE CAPTAIN’S LOGSMedium AIThe Agent Economy Is Here — Why AI Agents Need Their Own MarketplaceDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Tool That Built the Modern World Is Still the Most Powerful Thing in an Engineer’s ArsenalMedium AII Tested AI Coding Assistants on the Same Full-Stack App — Here’s the Real WinnerMedium AIIs the Arrow of Time a Crucial Missing Component in Artificial Intelligence?Medium AIAutomation vs AI: Not Just Similar — They Solve Fundamentally Different ProblemsMedium AIWalmart's AI Checkout Converted 3x Worse. The Interface Is Why.DEV Community✨ Why Humanity Still Moves Toward AI.Medium AIPredicting 10 Minutes in 1 Square Meter: The Ultimate AI Boundary?DEV CommunityOracle Database 26ai: The World’s First AI-Native Database Just Changed EverythingMedium AIGetting Data from Multiple Sources in Power BIDEV CommunityAI APIs That Simplify Complex FeaturesMedium AIPART FIVE – THE CAPTAIN’S LOGSMedium AIThe Agent Economy Is Here — Why AI Agents Need Their Own MarketplaceDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Language-Conditioned World Modeling for Visual Navigation

arXivby [Submitted on 23 Mar 2026]March 31, 20262 min read1 views
Source Quiz

arXiv:2603.26741v1 Announce Type: cross Abstract: We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instruct — Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng

Authors:Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng

View PDF HTML (experimental)

Abstract:We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at this https URL.

Comments: 19 pages, 6 figures, Code: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Cite as: arXiv:2603.26741 [cs.CV]

(or arXiv:2603.26741v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.26741

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhi-Qi Cheng [view email] [v1] Mon, 23 Mar 2026 04:10:14 UTC (1,920 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Language-Co…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 227 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers