Products model announce feature prediction global interface

M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding

arXiv cs.CVby U. V. B. L. Udugama, George Vosselman, Francesco NexApril 1, 20261 min read0 views

arXiv:2603.29236v1 Announce Type: new Abstract: Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency

View PDF HTML (experimental)

Abstract:Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

Comments: 6 pages, 5 figures, 5 tables. Preprint under review

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.29236 [cs.CV]

(or arXiv:2603.29236v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29236

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Bavantha Lakshan Udugama Udugama Vithanage [view email] [v1] Tue, 31 Mar 2026 04:07:42 UTC (1,875 KB)

Original source

arXiv cs.CV

https://arxiv.org/abs/2603.29236

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelannouncefeature

Self-Evolving AIFresh

AI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap

A new framework from Nvidia, UC Berkeley, and Stanford systematically tests how well AI models can control robots through code. The findings: without human-designed abstractions, even top models fail, but methods like targeted test-time compute scaling closes the gap. The article AI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap appeared first on The Decoder .

The Decoder

1mabout 9 hours ago

ReleasesFresh

Alibaba launches Qwen3.6-Plus, its third proprietary AI model in days

Alibaba has released Qwen3.6-Plus, its third proprietary AI model in just a few days. The article Alibaba launches Qwen3.6-Plus, its third proprietary AI model in days appeared first on The Decoder .

The Decoder

1mabout 9 hours ago

ModelsFresh

Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles

The latest round of the industry's top inference benchmark introduces multimodal and video models for the first time. Nvidia, AMD, and Intel each highlight different metrics, making direct comparisons difficult. The article Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles appeared first on The Decoder .

The Decoder

1mabout 8 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 139 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsFresh

Microsoft s MAI-Transcribe-1 runs 2.5x faster than its predecessor at $0.36 per audio hour

MAI-Transcribe-1 converts speech to text quickly and accurately in 25 languages, even with background noise. Microsoft is already using the model in its own products. The article Microsoft s MAI-Transcribe-1 runs 2.5x faster than its predecessor at $0.36 per audio hour appeared first on The Decoder .

The Decoder

1mabout 7 hours ago

ProductsFresh

Sakana AI launches "Ultra Deep Research" to automate weeks of strategy work

Sakana AI has unveiled "Sakana Marlin," an AI assistant for business customers that researches autonomously for up to eight hours and delivers finished analyses. The tool is designed to compress weeks of strategy work into hours and is currently in beta testing. The article Sakana AI launches "Ultra Deep Research" to automate weeks of strategy work appeared first on The Decoder .

The Decoder

1mabout 6 hours ago

ProductsFresh

Even Microsoft knows Copilot shouldn't be trusted with anything important

Terms admit it is for entertainment only and may get things wrong A recent surge of interest in Microsoft's Terms of Use for Copilot is a reminder that AI helpers are really just a bit of fun.…

The Register AI/ML

1mabout 6 hours ago

Products

AI in Agriculture: Transforming Farming with Technology - appinventiv.com

AI in Agriculture: Transforming Farming with Technology appinventiv.com

GNews AI agriculture

1m2 months ago