Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessTutorials vs. Transformations: What Beauty Content Wins in 2026Dev.to AIAnthropic employee error exposes Claude Code source - InfoWorldGoogle News: ClaudeMulti-Factor Strategies Aren't Exclusive to Big Firms: A Research Framework for Independent QuantsDev.to AISystem Instead of Team: Rethinking How Businesses Are BuiltDev.to AI10 лучших системных промптов ChatGPT: секреты успеха без опыта!Dev.to AIAI Post 4: When AI Gets It Wrong: Why AI Fails (And What That Teaches Us)Medium AIGoogle AI Overviews Are Reshaping Search — Here’s How to Get Your Business CitedDev.to AIThe $500/Month “Tool Trap” (And How Beginners Are Escaping It for Just $1)Medium AIAnthropic Accidentally Exposes Source Code for Claude Code - CNETGoogle News: ClaudeThe 4,500 Micro-Adjustment Question: Why the Best AI Still Needs a “Commander” in the Control Room.Medium AIJournal Figure Replication | Python Implementation of Sector Violin PlotsMedium AICommunity Without Tokens: What AI Dev Tools Can Learn from Crypto's Community PlaybookDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessTutorials vs. Transformations: What Beauty Content Wins in 2026Dev.to AIAnthropic employee error exposes Claude Code source - InfoWorldGoogle News: ClaudeMulti-Factor Strategies Aren't Exclusive to Big Firms: A Research Framework for Independent QuantsDev.to AISystem Instead of Team: Rethinking How Businesses Are BuiltDev.to AI10 лучших системных промптов ChatGPT: секреты успеха без опыта!Dev.to AIAI Post 4: When AI Gets It Wrong: Why AI Fails (And What That Teaches Us)Medium AIGoogle AI Overviews Are Reshaping Search — Here’s How to Get Your Business CitedDev.to AIThe $500/Month “Tool Trap” (And How Beginners Are Escaping It for Just $1)Medium AIAnthropic Accidentally Exposes Source Code for Claude Code - CNETGoogle News: ClaudeThe 4,500 Micro-Adjustment Question: Why the Best AI Still Needs a “Commander” in the Control Room.Medium AIJournal Figure Replication | Python Implementation of Sector Violin PlotsMedium AICommunity Without Tokens: What AI Dev Tools Can Learn from Crypto's Community PlaybookDev.to AI

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.27507v1 Announce Type: new Abstract: Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interactio — Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao

View PDF HTML (experimental)

Abstract:Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27507 [cs.CV]

(or arXiv:2603.27507v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27507

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Haifeng Huang [view email] [v1] Sun, 29 Mar 2026 04:16:04 UTC (4,960 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Chat-Scene+…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 172 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers