From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
arXiv:2603.26839v1 Announce Type: new Abstract: How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans — Alberto G. Rodriguez Salgado
View PDF HTML (experimental)
Abstract:How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6% on images to 80% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.
Comments: 15 pages, 10 figures. Code and mazes available at this https URL
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.26839 [cs.LG]
(or arXiv:2603.26839v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.26839
arXiv-issued DOI via DataCite
Submission history
From: Alberto Gonzalo Rodriguez Salgado [view email] [v1] Fri, 27 Mar 2026 08:10:05 UTC (4,287 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivAll you need is…. justification: algorithmic justifiability trumps transparency
Most ethical guidelines on AI tout algorithmic transparency, the openness of an algorithm’s inner workings to human scrutiny, as an important desideratum in algorithmic deployment. Algorithmic transparency has been touted as important for valuable goals like procedural fairness, AI trustworthiness, contestability and planning around AI decision-making. This paper argues that these goals are better served by a distinct desideratum, algorithmic justifiability, the ability of an algorithm to provide understanding about why the algorithm’s decision is correct.
Digital Trust Hinges on Access and AI
Digital trust is built early, but that’s also where it can be lost, the 2026 Digital Trust Index from Thales reveals. The research shows that digital trust is won or lost during sign-up, login, and throughout the lifecycle of personal data handling, the report which surveyed 15,000 consumers, business partners, and IT decision makers across […] The post Digital Trust Hinges on Access and AI appeared first on DIGIT .
Virtual reality and agential moral enhancement
Agential moral enhancement refers to the improvement of a moral agent qua moral agent through the application of technology. Anda Zahiu et al. recently argue that virtual reality can support such enhancement by fostering perspective-taking and, in turn, extending empathic concern. In this paper, we build on their proposal by arguing that virtual reality simulations of problematic situations offer a particularly promising pathway to agential moral enhancement. We contend that expanding empathic concern is insufficient for improving moral agency, as morally relevant situations are often characterized by uncertainty and conflicting interests, and empathic motivation alone may not be sufficiently durable. By contrast, we argue that to become morally better, an agent should refine moral beliefs
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

A quantum computer may need just 10,000 qubits to empty your crypto wallets, researchers say
The research shows quantum computers may break bitcoin and ether wallet encryption with far fewer qubits than previously thought, accelerating the push toward post-quantum security.

Bitcoin bulls scramble for post-quantum protection as Google drops bombshell paper
Google's finding that breaking bitcoin's cryptography requires 20x fewer qubits than previously estimated has triggered the strongest industry response to quantum threats since the Willow chip in 2024. Here's how builders, investors, and researchers are reacting.
Energy Landscapes of Emotion: Quantifying Brain Network Stability During Happy and Sad Face Processing Using EEG-Based Hopfield Energy
arXiv:2603.27644v1 Announce Type: new Abstract: Understanding how the human brain instantiates distinct emotional states is a key challenge in affective neuroscience. While network-based approaches have advanced emotion processing research,they remain largely descriptive,leaving the dynamical stability of emotional brain states unquantified.This study introduces a novel framework to quantify this stability by applying Hopfield network energy to empirically derived functional connectivity. High density EEG was recorded from 20 healthy adults during a happy versus sad facial expression discrimination task. Functional connectivity was estimated using the weighted Phase Lag Index to obtain artifact-robust,frequency-specific matrices, which served as coupling weights in a continuous Hopfield en


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!