Research Papers research paper arxiv machine-learning deep-learning

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

arXivMarch 31, 202610 min read0 views

arXiv:2603.26839v1 Announce Type: new Abstract: How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans — Alberto G. Rodriguez Salgado

View PDF HTML (experimental)

Abstract:How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6% on images to 80% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.

Comments: 15 pages, 10 figures. Code and mazes available at this https URL

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.26839 [cs.LG]

(or arXiv:2603.26839v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.26839

arXiv-issued DOI via DataCite

Submission history

From: Alberto Gonzalo Rodriguez Salgado [view email] [v1] Fri, 27 Mar 2026 08:10:05 UTC (4,287 KB)

Original source

arXiv

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

AI Tools

All you need is…. justification: algorithmic justifiability trumps transparency

Most ethical guidelines on AI tout algorithmic transparency, the openness of an algorithm’s inner workings to human scrutiny, as an important desideratum in algorithmic deployment. Algorithmic transparency has been touted as important for valuable goals like procedural fairness, AI trustworthiness, contestability and planning around AI decision-making. This paper argues that these goals are better served by a distinct desideratum, algorithmic justifiability, the ability of an algorithm to provide understanding about why the algorithm’s decision is correct.

Ethics and Information Technology

1m9 days ago

Analyst NewsFresh

Digital Trust Hinges on Access and AI

Digital trust is built early, but that’s also where it can be lost, the 2026 Digital Trust Index from Thales reveals. The research shows that digital trust is won or lost during sign-up, login, and throughout the lifecycle of personal data handling, the report which surveyed 15,000 consumers, business partners, and IT decision makers across […] The post Digital Trust Hinges on Access and AI appeared first on DIGIT .

Digit.fyi

1mabout 3 hours ago

ProductsRecent

Virtual reality and agential moral enhancement

Agential moral enhancement refers to the improvement of a moral agent qua moral agent through the application of technology. Anda Zahiu et al. recently argue that virtual reality can support such enhancement by fostering perspective-taking and, in turn, extending empathic concern. In this paper, we build on their proposal by arguing that virtual reality simulations of problematic situations offer a particularly promising pathway to agential moral enhancement. We contend that expanding empathic concern is insufficient for improving moral agency, as morally relevant situations are often characterized by uncertainty and conflicting interests, and empathic motivation alone may not be sufficiently durable. By contrast, we argue that to become morally better, an agent should refine moral beliefs

Ethics and Information Technology

1m1 day ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 160 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Google Quantum Paper Boosts Odds of Bitcoin ‘Q-Day’ by 2032, Researchers Warn

Google warned that quantum advances could break crypto security sooner than expected, with analysts recommending ‘appropriate urgency.’

Decrypt AI

1mabout 5 hours ago

Research PapersFresh

A quantum computer may need just 10,000 qubits to empty your crypto wallets, researchers say

The research shows quantum computers may break bitcoin and ether wallet encryption with far fewer qubits than previously thought, accelerating the push toward post-quantum security.

CoinDesk AI

1mabout 7 hours ago

Research PapersFresh

Bitcoin bulls scramble for post-quantum protection as Google drops bombshell paper

Google's finding that breaking bitcoin's cryptography requires 20x fewer qubits than previously estimated has triggered the strongest industry response to quantum threats since the Willow chip in 2024. Here's how builders, investors, and researchers are reacting.

CoinDesk AI

1mabout 6 hours ago

Research PapersRecent

Energy Landscapes of Emotion: Quantifying Brain Network Stability During Happy and Sad Face Processing Using EEG-Based Hopfield Energy

arXiv:2603.27644v1 Announce Type: new Abstract: Understanding how the human brain instantiates distinct emotional states is a key challenge in affective neuroscience. While network-based approaches have advanced emotion processing research,they remain largely descriptive,leaving the dynamical stability of emotional brain states unquantified.This study introduces a novel framework to quantify this stability by applying Hopfield network energy to empirically derived functional connectivity. High density EEG was recorded from 20 healthy adults during a happy versus sad facial expression discrimination task. Functional connectivity was estimated using the weighted Phase Lag Index to obtain artifact-robust,frequency-specific matrices, which served as coupling weights in a continuous Hopfield en

arXiv q-bio.NC

2mabout 14 hours ago