Research Papers research paper arxiv computer-vision image-recognition

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

arXivMarch 31, 20262 min read0 views

arXiv:2603.27437v1 Announce Type: new Abstract: Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose Spatia — Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan

View PDF HTML (experimental)

Abstract:Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

Comments: CVPR 2026, Project Website: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27437 [cs.CV]

(or arXiv:2603.27437v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27437

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Shijie Zhou [view email] [v1] Sat, 28 Mar 2026 22:49:40 UTC (5,499 KB)

Original source

arXiv

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsRecent

ChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests - PsyPost

<a href="https://news.google.com/rss/articles/CBMiowFBVV95cUxQTC13Zm5WZG9iQmRmZFpXM0ctamxRZ3E5N0ZFRDlIOWhHa2l1eTVmaTBzZFR6VGx6QjR2VEd0SFZoY0l4ZXVPSVF1c3FDTnE4Nk5zanNMWXhyLVpxVVlIUEZhZUFxXzYtQkRnM2E5eTN5M3NYUmJsX09YMTR6dWhLQ2hUSk55S2FJQXV4WFEzVFB4ZzhyS21RUzFoMDJzSXpQQ1pR?oc=5" target="_blank">ChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests</a> PsyPost

Google News: ChatGPT

1m1 day ago

Models

Gemini Deep Think: Redefining the Future of Scientific Research - Google DeepMind

<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxPRmtMZnRYNW04a3Q4b0dSQm9aall0S3BJWFFOczQ3dmdfX3cyR1plYlotZHg5ekhlZ2s3cUd6Y1pyT3lkVEJrV1V0c0NWVlBQNHlMbXlEbXpTYWlSVUZHVllYZWdSb2RMU2JTelVGMXBEckZSdWt5VUs1d24zdUVLaExpS0NZMmtpSTRoNDd2MHRZdlBRaWxSWmVTNk0wRWtRQ2NaV2ln?oc=5" target="_blank">Gemini Deep Think: Redefining the Future of Scientific Research</a> Google DeepMind

Google News: DeepMind

1mabout 2 months ago

Research Papers

Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance

<a href="https://news.google.com/rss/articles/CBMijwFBVV95cUxOYTZwZk0walRzazJQampab1FCM2k4Uy1SYk12UWZraENkUXYzZU9kbnlGTGZJS0pFaTZIUFlKZFkwVnJkRzhKbXhNV3lNdUZpdF8tSU1LMklqcTZlUDZERDZ3VzdWbjNQYUN4T2d2ZkRQT1R1MUc0LXdYNndPQTNzbXBXMXJhb3ZEZE00ZFMtaw?oc=5" target="_blank">Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push</a> Yahoo Finance

Google News: DeepMind

1m25 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 182 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance

Google News: DeepMind

1m25 days ago

Research PapersLive

Is AI's visual understanding mostly a 'mirage'? New research suggests so. - Fortune

<a href="https://news.google.com/rss/articles/CBMihgFBVV95cUxORGxTdWF3bnBiU0VaUEVtanJCT1htWVdjTUo3UnJycUxKcl9HU3Q1ODNINW9na1R0aENXXzhGYnc0Qlg3aGFGM2hiTVNFSjBZQ2FPUElZYmVGdzhfU0d5QkR2cDVnSzJBd2Y5WEVMLUJfWHY4YUc5c1I1U1dUQW9TeU56U1JjQQ?oc=5" target="_blank">Is AI's visual understanding mostly a 'mirage'? New research suggests so.</a> Fortune

Google News: Machine Learning

1mabout 1 hour ago

Research PapersLive

Here's the severance package Oracle offered laid-off US employees

Severance includes 4 weeks base salary plus 1 week per year of employment.

Business Insider

1m11 minutes ago

Research Papers

71% of Businesses Are Invisible to AI - And Most Don't Know It Yet

Search didn't evolve - it got replaced. AI systems don't return links, they return answers. New research shows 71% of businesses are invisible to AI recommendation engines. Brand size doesn't matter. Vanguard scores 16/100 while ProtonMail scores 88. The playbook that won Google doesn't work here. Clarity beats clout. Read All

Hackernoon AI

1m8 days ago