A Semantic Observer Layer for Autonomous Vehicles: Pre-Deployment Feasibility Study of VLMs for Low-Latency Anomaly Detection
arXiv:2603.28888v1 Announce Type: new Abstract: Semantic anomalies-context-dependent hazards that pixel-level detectors cannot reason about-pose a critical safety risk in autonomous driving. We propose a \emph{semantic observer layer}: a quantized vision-language model (VLM) running at 1--2\,Hz alongside the primary AV control loop, monitoring for semantic edge cases, and triggering fail-safe handoffs when detected. Using Nvidia Cosmos-Reason1-7B with NVFP4 quantization and FlashAttention2, we achieve ~500 ms inference a ~50x speedup over the unoptimized FP16 baseline (no quantization, standard PyTorch attention) on the same hardware--satisfying the observer timing budget. We benchmark accuracy, latency, and quantization behavior in static and video conditions, identify NF4 recall collapse
View PDF HTML (experimental)
Abstract:Semantic anomalies-context-dependent hazards that pixel-level detectors cannot reason about-pose a critical safety risk in autonomous driving. We propose a \emph{semantic observer layer}: a quantized vision-language model (VLM) running at 1--2,Hz alongside the primary AV control loop, monitoring for semantic edge cases, and triggering fail-safe handoffs when detected. Using Nvidia Cosmos-Reason1-7B with NVFP4 quantization and FlashAttention2, we achieve ~500 ms inference a ~50x speedup over the unoptimized FP16 baseline (no quantization, standard PyTorch attention) on the same hardware--satisfying the observer timing budget. We benchmark accuracy, latency, and quantization behavior in static and video conditions, identify NF4 recall collapse (10.6%) as a hard deployment constraint, and a hazard analysis mapping performance metrics to safety goals. The results establish a pre-deployment feasibility case for the semantic observer architecture on embodied-AI AV platforms.
Subjects:
Robotics (cs.RO)
Cite as: arXiv:2603.28888 [cs.RO]
(or arXiv:2603.28888v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2603.28888
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Aliasghar Arab [view email] [v1] Mon, 30 Mar 2026 18:14:03 UTC (2,393 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure
Microsoft’s decision to invest $10 billion in Japan between 2026 and 2029 looks like one of those stories that is easy to file under ‘big tech spends big again’. That would be a mistake. This is not just another data center expansion. It is a clear signal that the next phase of the AI race is shifting away from flashy model launches and toward something much harder to copy: national-scale infrastructure, workforce readiness, and cyber resilience. According to Reuters and follow-on reporting from Bloomberg and The Japan Times, the package is aimed at expanding AI infrastructure in Japan, deepening cybersecurity cooperation with the government, and supporting large-scale skills development. That combination matters. Microsoft is not merely selling cloud capacity into an attractive market. It

How Cloud-Based Data Systems Are Transforming Businesses
Introduction In today’s digital-first world, businesses are generating more data than ever before. Managing this data efficiently has become a critical challenge—and opportunity. Traditional on-premise systems are no longer sufficient to handle the scale, speed, and complexity of modern data needs. This is where cloud-based data systems come into play. By offering scalable storage, real-time processing, and cost-effective infrastructure, cloud technologies are revolutionizing how businesses operate, innovate, and grow. What Are Cloud-Based Data Systems? Cloud-based data systems refer to platforms and services that store, manage, and process data over the internet instead of local servers. These systems allow businesses to access their data anytime, anywhere, without the need for heavy phys
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
arXiv:2604.01723v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CA



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!