SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
arXiv:2603.27437v1 Announce Type: new Abstract: Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose Spatia — Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan
View PDF HTML (experimental)
Abstract:Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
Comments: CVPR 2026, Project Website: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.27437 [cs.CV]
(or arXiv:2603.27437v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.27437
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Shijie Zhou [view email] [v1] Sat, 28 Mar 2026 22:49:40 UTC (5,499 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests - PsyPost
<a href="https://news.google.com/rss/articles/CBMiowFBVV95cUxQTC13Zm5WZG9iQmRmZFpXM0ctamxRZ3E5N0ZFRDlIOWhHa2l1eTVmaTBzZFR6VGx6QjR2VEd0SFZoY0l4ZXVPSVF1c3FDTnE4Nk5zanNMWXhyLVpxVVlIUEZhZUFxXzYtQkRnM2E5eTN5M3NYUmJsX09YMTR6dWhLQ2hUSk55S2FJQXV4WFEzVFB4ZzhyS21RUzFoMDJzSXpQQ1pR?oc=5" target="_blank">ChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests</a> <font color="#6f6f6f">PsyPost</font>
Gemini Deep Think: Redefining the Future of Scientific Research - Google DeepMind
<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxPRmtMZnRYNW04a3Q4b0dSQm9aall0S3BJWFFOczQ3dmdfX3cyR1plYlotZHg5ekhlZ2s3cUd6Y1pyT3lkVEJrV1V0c0NWVlBQNHlMbXlEbXpTYWlSVUZHVllYZWdSb2RMU2JTelVGMXBEckZSdWt5VUs1d24zdUVLaExpS0NZMmtpSTRoNDd2MHRZdlBRaWxSWmVTNk0wRWtRQ2NaV2ln?oc=5" target="_blank">Gemini Deep Think: Redefining the Future of Scientific Research</a> <font color="#6f6f6f">Google DeepMind</font>
Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance
<a href="https://news.google.com/rss/articles/CBMijwFBVV95cUxOYTZwZk0walRzazJQampab1FCM2k4Uy1SYk12UWZraENkUXYzZU9kbnlGTGZJS0pFaTZIUFlKZFkwVnJkRzhKbXhNV3lNdUZpdF8tSU1LMklqcTZlUDZERDZ3VzdWbjNQYUN4T2d2ZkRQT1R1MUc0LXdYNndPQTNzbXBXMXJhb3ZEZE00ZFMtaw?oc=5" target="_blank">Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push</a> <font color="#6f6f6f">Yahoo Finance</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance
<a href="https://news.google.com/rss/articles/CBMijwFBVV95cUxOYTZwZk0walRzazJQampab1FCM2k4Uy1SYk12UWZraENkUXYzZU9kbnlGTGZJS0pFaTZIUFlKZFkwVnJkRzhKbXhNV3lNdUZpdF8tSU1LMklqcTZlUDZERDZ3VzdWbjNQYUN4T2d2ZkRQT1R1MUc0LXdYNndPQTNzbXBXMXJhb3ZEZE00ZFMtaw?oc=5" target="_blank">Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push</a> <font color="#6f6f6f">Yahoo Finance</font>
Is AI's visual understanding mostly a 'mirage'? New research suggests so. - Fortune
<a href="https://news.google.com/rss/articles/CBMihgFBVV95cUxORGxTdWF3bnBiU0VaUEVtanJCT1htWVdjTUo3UnJycUxKcl9HU3Q1ODNINW9na1R0aENXXzhGYnc0Qlg3aGFGM2hiTVNFSjBZQ2FPUElZYmVGdzhfU0d5QkR2cDVnSzJBd2Y5WEVMLUJfWHY4YUc5c1I1U1dUQW9TeU56U1JjQQ?oc=5" target="_blank">Is AI's visual understanding mostly a 'mirage'? New research suggests so.</a> <font color="#6f6f6f">Fortune</font>
71% of Businesses Are Invisible to AI - And Most Don't Know It Yet
Search didn't evolve - it got replaced. AI systems don't return links, they return answers. New research shows 71% of businesses are invisible to AI recommendation engines. Brand size doesn't matter. Vanguard scores 16/100 while ProtonMail scores 88. The playbook that won Google doesn't work here. Clarity beats clout. Read All

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!