Research Papers research paper arxiv ai artificial-intelligence

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

arXivMarch 31, 202610 min read0 views

arXiv:2505.03821v2 Announce Type: replace-cross Abstract: We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a new set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations -- such as object position relative to the minifigure and the minifigure's orientation -- and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each task is paired with a se — Gracjan G\'oral, Alicja Ziarko, Piotr Mi{\l}o\'s, Micha{\l} Nauman, Maciej Wo{\l}czyk, Micha{\l} Kosi\'nski

View PDF HTML (experimental)

Abstract:We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a new set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations -- such as object position relative to the minifigure and the minifigure's orientation -- and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. We evaluate several high-performing models, including Gemini Robotics-ER 1.5, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, GPT-4, and Qwen3, and find that while they excel at scene understanding, performance declines markedly on spatial reasoning and deteriorates further on perspective taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

Comments: Accepted at CVPR 2026 Findings

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2505.03821 [cs.CV]

(or arXiv:2505.03821v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2505.03821

arXiv-issued DOI via DataCite

Submission history

From: Gracjan Góral [view email] [v1] Sat, 3 May 2025 00:10:41 UTC (47,539 KB) [v2] Sat, 28 Mar 2026 12:51:44 UTC (8,147 KB)

Original source

arXiv

https://arxiv.org/abs/2505.03821

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m1 day ago

Self-Evolving AI

NATO tests autonomous drone technology in DARPA-style competition

In a basement beneath City St George s, University of London, senior NATO leaders watch on as four research teams demonstrate the latest in AI-controlled, autonomous drone technolo0gy

New Scientist Tech

1mover 1 year ago

ProductsFresh

QuadRank: Engineering a High Throughput Rank

arXiv:2602.04103v2 Announce Type: replace Abstract: Given a text, a query $\mathsf{rank}(q, c)$ counts the number of occurrences of character $c$ among the first $q$ characters of the text. Space-efficient methods to answer these rank queries form an important building block in many succinct data structures. For example, the FM-index is a widely used data structure that uses rank queries to locate all occurrences of a pattern in a text. In bioinformatics applications, the goal is usually to process a given input as fast as possible. Thus, data structures should have high throughput when used with many threads. Contributions. For the binary alphabet, we develop BiRank with 3.28% space overhead. It merges the central ideas of two recent papers: (1) we interleave (inline) offsets in each cach

arXiv cs.DS

2mabout 7 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 142 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Online Flow Time Minimization: Tight Bounds for Non-Preemptive Algorithms

arXiv:2511.03485v3 Announce Type: replace Abstract: This paper studies the online scheduling problem of minimizing total flow time for $n$ jobs on $m$ identical machines. A classical $\Omega(n)$ lower bound shows that no deterministic single-machine algorithm can beat the trivial greedy, even when $n$ is known in advance. However, this barrier is specific to deterministic algorithms on a single machine, leaving open what randomization, multiple machines, or the kill-and-restart capability can achieve. We give a nearly complete answer. For randomized non-preemptive algorithms, we establish a tight $\Theta(\sqrt{n/m})$ competitive ratio, which also improves the best offline approximation to $O(\sqrt{n/m})$. For deterministic non-preemptive algorithms on multiple machines, we prove an $O(n/m^

arXiv cs.DS

2mabout 7 hours ago

Research PapersFresh

On the average-case complexity landscape for Tensor-Isomorphism-complete problems over finite fields

arXiv:2604.00591v1 Announce Type: cross Abstract: In Grochow and Qiao (SIAM J. Comput., 2021), the complexity class Tensor Isomorphism (TI) was introduced and isomorphism problems for groups, algebras, and polynomials were shown to be TI-complete. In this paper, we study average-case algorithms for several TI-complete problems over finite fields, including algebra isomorphism, matrix code conjugacy, and $4$-tensor isomorphism. Our main results are as follows. Over the finite field of order $q$, we devise (1) average-case polynomial-time algorithms for algebra isomorphism and matrix code conjugacy that succeed in a $1/\Theta(q)$ fraction of inputs and (2) an average-case polynomial-time algorithm for the $4$-tensor isomorphism that succeeds in a $1/q^{\Theta(1)}$ fraction of inputs. Prior t

arXiv cs.DS

2mabout 7 hours ago

Research PapersFresh

Stable algorithms cannot reliably find isolated perceptron solutions

arXiv:2604.00328v1 Announce Type: cross Abstract: We study the binary perceptron, a random constraint satisfaction problem that asks to find a Boolean vector in the intersection of independently chosen random halfspaces. A striking feature of this model is that at every positive constraint density, it is expected that a $1-o_N(1)$ fraction of solutions are \emph{strongly isolated}, i.e. separated from all others by Hamming distance $\Omega(N)$. At the same time, efficient algorithms are known to find solutions at certain positive constraint densities. This raises a natural question: can any isolated solution be algorithmically visible? We answer this in the negative: no algorithm whose output is stable under a tiny Gaussian resampling of the disorder can \emph{reliably} locate isolated sol

arXiv cs.DS

2mabout 7 hours ago

Research PapersLive

I ran by instinct for years. Then I built an AI running coach.

How a 50 km trail race, a broken ChatGPT workflow, and 60+ research papers led me to create Coach Leo. Continue reading on Medium »

Medium AI

1mabout 1 hour ago