M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding
arXiv:2603.29236v1 Announce Type: new Abstract: Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency
View PDF HTML (experimental)
Abstract:Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.
Comments: 6 pages, 5 figures, 5 tables. Preprint under review
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.29236 [cs.CV]
(or arXiv:2603.29236v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.29236
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Bavantha Lakshan Udugama Udugama Vithanage [view email] [v1] Tue, 31 Mar 2026 04:07:42 UTC (1,875 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelannouncefeature
AI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap
A new framework from Nvidia, UC Berkeley, and Stanford systematically tests how well AI models can control robots through code. The findings: without human-designed abstractions, even top models fail, but methods like targeted test-time compute scaling closes the gap. The article AI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap appeared first on The Decoder .


Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles
The latest round of the industry's top inference benchmark introduces multimodal and video models for the first time. Nvidia, AMD, and Intel each highlight different metrics, making direct comparisons difficult. The article Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles appeared first on The Decoder .
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products
Microsoft s MAI-Transcribe-1 runs 2.5x faster than its predecessor at $0.36 per audio hour
MAI-Transcribe-1 converts speech to text quickly and accurately in 25 languages, even with background noise. Microsoft is already using the model in its own products. The article Microsoft s MAI-Transcribe-1 runs 2.5x faster than its predecessor at $0.36 per audio hour appeared first on The Decoder .

Sakana AI launches "Ultra Deep Research" to automate weeks of strategy work
Sakana AI has unveiled "Sakana Marlin," an AI assistant for business customers that researches autonomously for up to eight hours and delivers finished analyses. The tool is designed to compress weeks of strategy work into hours and is currently in beta testing. The article Sakana AI launches "Ultra Deep Research" to automate weeks of strategy work appeared first on The Decoder .

Even Microsoft knows Copilot shouldn't be trusted with anything important
Terms admit it is for entertainment only and may get things wrong A recent surge of interest in Microsoft's Terms of Use for Copilot is a reminder that AI helpers are really just a bit of fun.…


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!