Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessDatabase Performance Issues in Production: Identifying and Resolving Masked Problems from Small-Scale TestingDEV CommunityMeta, Google under attack as court cases bypass 30-year-old legal shield - CNBCGNews AI GoogleEngineering Backpressure: Keeping AI-Generated Code Honest Across 10 SvelteKit ReposDEV CommunitySecuring Asgard: Why I Built a Card Game Suite for Docker SecurityDEV CommunityAgentic Engineering Journey — Brain DumpDEV CommunityStandardizing 'I Built' Posts: A Unified Tool and Narrative Framework for Efficient Project SharingDEV Community404: The Page That Gets WorseDEV CommunityThe Full-Stack Factory: How Digital Architectures are Re-Engineering the Textile Supply ChainDEV CommunityThe Security Scanner Was the Attack Vector — How Supply Chain Attacks Hit AI Agents DifferentlyDEV CommunityMCP: Programmatic Tool Calling (Code Mode) with OpenSandboxDEV CommunityDesigning a Message Bus for AI Agents — Lightweight Communication for 20+ Autonomous AgentsDEV CommunityGoogle gives enterprises new controls to manage AI inference costs and reliability - InfoWorldGNews AI GoogleBlack Hat USADark ReadingBlack Hat AsiaAI BusinessDatabase Performance Issues in Production: Identifying and Resolving Masked Problems from Small-Scale TestingDEV CommunityMeta, Google under attack as court cases bypass 30-year-old legal shield - CNBCGNews AI GoogleEngineering Backpressure: Keeping AI-Generated Code Honest Across 10 SvelteKit ReposDEV CommunitySecuring Asgard: Why I Built a Card Game Suite for Docker SecurityDEV CommunityAgentic Engineering Journey — Brain DumpDEV CommunityStandardizing 'I Built' Posts: A Unified Tool and Narrative Framework for Efficient Project SharingDEV Community404: The Page That Gets WorseDEV CommunityThe Full-Stack Factory: How Digital Architectures are Re-Engineering the Textile Supply ChainDEV CommunityThe Security Scanner Was the Attack Vector — How Supply Chain Attacks Hit AI Agents DifferentlyDEV CommunityMCP: Programmatic Tool Calling (Code Mode) with OpenSandboxDEV CommunityDesigning a Message Bus for AI Agents — Lightweight Communication for 20+ Autonomous AgentsDEV CommunityGoogle gives enterprises new controls to manage AI inference costs and reliability - InfoWorldGNews AI Google
AI NEWS HUBbyEIGENVECTOREigenvector

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

arXivby [Submitted on 26 Mar 2026]March 26, 20262 min read1 views
Source Quiz

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatia — Huizhi Liang, Yichao Shen, Yu Deng

View PDF HTML (experimental)

Abstract:Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

Comments: Accepted by CVPR 2026. Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.25411 [cs.CV]

(or arXiv:2603.25411v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25411

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Huizhi Liang [view email] [v1] Thu, 26 Mar 2026 13:08:12 UTC (1,923 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
HiSpatial: …researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers