Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessABAP OOP Design Patterns — Part 2: Factory, Observer, and Decorator Patterns in Real SAP SystemsDEV CommunityIntuit's AI agents hit 85% repeat usage. The secret was keeping humans involvedVentureBeat AIThe reputation of troubled YC startup Delve has gotten even worseTechCrunchSam Altman's Sister Amends Lawsuit Accusing OpenAI CEO of Sexual Abuse - GV WireGoogle News: OpenAI‘System failure’ paralyzes Baidu robotaxis in ChinaTechCrunch AICognichip, which is building an AI model for chip design, raised a $60M Series A led by Seligman Ventures, with participation from new board member Lip-Bu Tan (Tim Fernholz/TechCrunch)TechmemeThe Perils of AI-Generated Legal Advice for Dealers and Finance Companies - JD SupraGoogle News: Generative AIDrones Reportedly Being Used to Help Smugglers Cross the U.S.-Mexico BorderInternational Business TimesWhatsApp just caught an Italian spyware firm building a fake version of its app for iPhonesThe Next Web NeuralGoogle offers researchers early access to Willow quantum processorTechSpotCrack ML Interviews with Confidence: Anomaly Detection (20 Q&A)Towards AIInspectMind AI (YC W24) Is HiringHacker News TopBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessABAP OOP Design Patterns — Part 2: Factory, Observer, and Decorator Patterns in Real SAP SystemsDEV CommunityIntuit's AI agents hit 85% repeat usage. The secret was keeping humans involvedVentureBeat AIThe reputation of troubled YC startup Delve has gotten even worseTechCrunchSam Altman's Sister Amends Lawsuit Accusing OpenAI CEO of Sexual Abuse - GV WireGoogle News: OpenAI‘System failure’ paralyzes Baidu robotaxis in ChinaTechCrunch AICognichip, which is building an AI model for chip design, raised a $60M Series A led by Seligman Ventures, with participation from new board member Lip-Bu Tan (Tim Fernholz/TechCrunch)TechmemeThe Perils of AI-Generated Legal Advice for Dealers and Finance Companies - JD SupraGoogle News: Generative AIDrones Reportedly Being Used to Help Smugglers Cross the U.S.-Mexico BorderInternational Business TimesWhatsApp just caught an Italian spyware firm building a fake version of its app for iPhonesThe Next Web NeuralGoogle offers researchers early access to Willow quantum processorTechSpotCrack ML Interviews with Confidence: Anomaly Detection (20 Q&A)Towards AIInspectMind AI (YC W24) Is HiringHacker News Top

Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2512.05597v2 Announce Type: replace Abstract: Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation and 3D object detection, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene understanding. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive itera — Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers

View PDF HTML (experimental)

Abstract:Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation and 3D object detection, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene understanding. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on synthetic and real-world benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only $\sim7.5%$ additional parameters.

Comments: 15 pages, 14 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2512.05597 [cs.CV]

(or arXiv:2512.05597v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.05597

arXiv-issued DOI via DataCite

Submission history

From: Ruihong Yin [view email] [v1] Fri, 5 Dec 2025 10:35:43 UTC (5,113 KB) [v2] Mon, 30 Mar 2026 08:15:44 UTC (9,181 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Fast SceneS…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 179 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers