Research Papers research paper arxiv computer-vision image-recognition

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

arXivby [Submitted on 26 Mar 2026]March 26, 20262 min read1 views

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatia — Huizhi Liang, Yichao Shen, Yu Deng

View PDF HTML (experimental)

Abstract:Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

Comments: Accepted by CVPR 2026. Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.25411 [cs.CV]

(or arXiv:2603.25411v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25411

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Huizhi Liang [view email] [v1] Thu, 26 Mar 2026 13:08:12 UTC (1,923 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25411v1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsFresh

MorphoGuard: A Morphology-Based Whole-Body Interactive Motion Controller

arXiv:2604.01517v1 Announce Type: cross Abstract: Whole-body control (WBC) has demonstrated significant advantages in complex interactive movements of high-dimensional robotic systems. However, when a robot is required to handle dynamic multi-contact combinations along a single kinematic chain-such as pushing open a door with its elbow while grasping an object-it faces major obstacles in terms of complex contact representation and joint configuration coupling. To address this, we propose a new control approach that explicitly manages arbitrary contact combinations, aiming to endow robots with whole-body interactive capabilities. We develop a morphology-constrained WBC network (MorphoGuard)-which is trained on a self-constructed dual-arm physical and simulation platform. A series of model r

arXiv cs.RO

2mabout 9 hours ago

ReleasesFresh

Global Geometry of Orthogonal Foliations in the Control Allocation of Signed-Quadratic Systems

arXiv:2604.01912v1 Announce Type: cross Abstract: This work formalizes the differential topology of redundancy resolution for systems governed by signed-quadratic actuation maps. By analyzing the minimally redundant case, the global topology of the continuous fiber bundle defining the nonlinear actuation null-space is established. The distribution orthogonal to these fibers is proven to be globally integrable and governed by an exact logarithmic potential field. This field foliates the actuator space, inducing a structural stratification of all orthants into transverse layers whose combinatorial sizes follow a strictly binomial progression. Within these layers, adjacent orthants are continuously connected via lower-dimensional strata termed reciprocal hinges, while the layers themselves ar

arXiv cs.RO

2mabout 9 hours ago

Research PapersFresh

Neural Robust Control on Lie Groups Using Contraction Methods (Extended Version)

arXiv:2604.01448v1 Announce Type: cross Abstract: In this paper, we propose a learning framework for synthesizing a robust controller for dynamical systems evolving on a Lie group. A robust control contraction metric (RCCM) and a neural feedback controller are jointly trained to enforce contraction conditions on the Lie group manifold. Sufficient conditions are derived for the existence of such an RCCM and neural controller, ensuring that the geometric constraints imposed by the manifold structure are respected while establishing a disturbance-dependent tube that bounds the output trajectories. As a case study, a feedback controller for a quadrotor is designed using the proposed framework. Its performance is evaluated using numerical simulations and compared with a geometric controller.

arXiv cs.RO

1mabout 9 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Neural Robust Control on Lie Groups Using Contraction Methods (Extended Version)

arXiv cs.RO

1mabout 9 hours ago

Research PapersFresh

A virtual-variable-length method for robust inverse kinematics of multi-segment continuum robots

arXiv:2604.02256v1 Announce Type: new Abstract: This paper proposes a new, robust method to solve the inverse kinematics (IK) of multi-segment continuum manipulators. Conventional Jacobian-based solvers, especially when initialized from neutral/rest configurations, often exhibit slow convergence and, in certain conditions, may fail to converge (deadlock). The Virtual-Variable-Length (VVL) method proposed here introduces fictitious variations of segments' length during the solution iteration, conferring virtual axial degrees of freedom that alleviate adverse behaviors and constraints, thus enabling or accelerating convergence. Comprehensive numerical experiments were conducted to compare the VVL method against benchmark Jacobian-based and Damped Least Square IK solvers. Across more than $1.

arXiv cs.RO

1mabout 9 hours ago

Research PapersFresh

O-ConNet: Geometry-Aware End-to-End Inference of Over-Constrained Spatial Mechanisms

arXiv:2604.02038v1 Announce Type: new Abstract: Deep learning has shown strong potential for scientific discovery, but its ability to model macroscopic rigid-body kinematic constraints remains underexplored. We study this problem on spatial over-constrained mechanisms and propose O-ConNet, an end-to-end framework that infers mechanism structural parameters from only three sparse reachable points while reconstructing the full motion trajectory, without explicitly solving constraint equations during inference. On a self-constructed Bennett 4R dataset of 42,860 valid samples, O-ConNet achieves Param-MAE 0.276 +/- 0.077 and Traj-MAE 0.145 +/- 0.018 (mean +/- std over 10 runs), outperforming the strongest sequence baseline (LSTM-Seq2Seq) by 65.1 percent and 88.2 percent, respectively. These res

arXiv cs.RO

1mabout 9 hours ago

Research PapersLive

Algorand Soars Double-Digits On Google ‘Post-Quantum Protocols’ Citation

Algorand jumped following its mention in a Google research paper, as post-quantum cryptography emerges as a new crypto narrative.

Decrypt AI

4mabout 1 hour ago