Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessNothing’s AI devices plan reportedly contains smart glasses and earbudsTechCrunchRuben Gallego Takes Aim At Marco Rubio Over Threat To Leave NATO: 'No Right To Take Us Out Of It'International Business TimesIndia says foreign investment gains made before 2017 are exempt from its General Anti-Avoidance Rules, after a court required Tiger to pay $1.6B on a 2018 sale (Reuters)TechmemeCathie Wood on OpenAI: We continue to serve as a bridge between private and public markets - CNBCGoogle News: OpenAIMemahami Dasar Web Development: Mengenal Frontend dan BackendDEV CommunityCombining the robot operating system with LLMs for natural-language controlPhys.org AICombining the robot operating system with LLMs for natural-language control - Tech XploreGoogle News: LLMEU bars AI-generated content from official communications, according to PoliticoThe DecoderI tested ChatGPT vs. Claude to see which is better - and if it's worth switchingZDNet AII tested ChatGPT vs. Claude to see which is better - and if it's worth switching - ZDNETGoogle News: ChatGPTOpenClaw AI Agent Framework: Run Autonomous AI on Your Own HardwareDEV CommunityForbes Daily: OpenAI Is Now Worth A Whopping $852 Billion - ForbesGoogle News: OpenAIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessNothing’s AI devices plan reportedly contains smart glasses and earbudsTechCrunchRuben Gallego Takes Aim At Marco Rubio Over Threat To Leave NATO: 'No Right To Take Us Out Of It'International Business TimesIndia says foreign investment gains made before 2017 are exempt from its General Anti-Avoidance Rules, after a court required Tiger to pay $1.6B on a 2018 sale (Reuters)TechmemeCathie Wood on OpenAI: We continue to serve as a bridge between private and public markets - CNBCGoogle News: OpenAIMemahami Dasar Web Development: Mengenal Frontend dan BackendDEV CommunityCombining the robot operating system with LLMs for natural-language controlPhys.org AICombining the robot operating system with LLMs for natural-language control - Tech XploreGoogle News: LLMEU bars AI-generated content from official communications, according to PoliticoThe DecoderI tested ChatGPT vs. Claude to see which is better - and if it's worth switchingZDNet AII tested ChatGPT vs. Claude to see which is better - and if it's worth switching - ZDNETGoogle News: ChatGPTOpenClaw AI Agent Framework: Run Autonomous AI on Your Own HardwareDEV CommunityForbes Daily: OpenAI Is Now Worth A Whopping $852 Billion - ForbesGoogle News: OpenAI

Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2601.05529v4 Announce Type: replace Abstract: High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that important decision-making failures can persist even when overall performance is strong, underscoring the need for failure-focused analysis to understand model — Jua Han, Jaeyoon Seo, Jungbin Min, Sieun Choi, Huichan Seo, Jihie Kim, Jean Oh

View PDF HTML (experimental)

Abstract:High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that important decision-making failures can persist even when overall performance is strong, underscoring the need for failure-focused analysis to understand model limitations and guide future progress. In a path-planning setting with unknown cells, GPT-5 achieved a high success rate of 93%, yet the remaining cases still included invalid paths. We also find that newer models are not always more reliable than their predecessors. In reasoning under safety-relevant information, Gemini-2.5 Flash achieved only 67% on the challenging emergency-evacuation task, underperforming Gemini-2.0 Flash, which reached 100% under the same condition. Across all evaluations, models exhibited structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions. These findings show that foundation models still exhibit substantial failures in navigation-related decision making and require fine-grained evaluation before they can be trusted. Project page: this https URL

Comments: Corrected author order in metadata; manuscript changed

Subjects:

Artificial Intelligence (cs.AI); Robotics (cs.RO)

Cite as: arXiv:2601.05529 [cs.AI]

(or arXiv:2601.05529v4 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2601.05529

arXiv-issued DOI via DataCite

Submission history

From: Jua Han [view email] [v1] Fri, 9 Jan 2026 05:04:15 UTC (4,548 KB) [v2] Thu, 15 Jan 2026 05:09:03 UTC (4,548 KB) [v3] Sun, 18 Jan 2026 11:03:44 UTC (4,548 KB) [v4] Fri, 27 Mar 2026 04:51:48 UTC (2,440 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Before We T…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 201 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers