Models model language model announce available valuation market

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

arXiv cs.CVby Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi ReddiApril 1, 20262 min read0 views

Source Quiz

arXiv:2603.29281v1 Announce Type: new Abstract: A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied

View PDF HTML (experimental)

Abstract:A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Cite as: arXiv:2603.29281 [cs.CV]

(or arXiv:2603.29281v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29281

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Amirreza Rouhi [view email] [v1] Tue, 31 Mar 2026 05:29:22 UTC (10,341 KB)

Original source

arXiv cs.CV

https://arxiv.org/abs/2603.29281

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

ProductsLive

Choosing an AI Agent Orchestrator in 2026: A Practical Comparison

Running one AI coding agent is easy. Running three in parallel on the same codebase is where things get interesting — and where you need to make a tooling choice. There's no "best" orchestrator. There's the right one for your workflow. Here's an honest comparison of five approaches, with the tradeoffs I've seen after months of running multi-agent setups. The Options 1. Raw tmux Scripts What it is: Shell scripts that launch agents in tmux panes. DIY orchestration. Pros: Zero dependencies beyond tmux Full control over every detail No abstractions to fight You already know how it works Cons: No state management — you track everything manually No message routing between agents No test gating — agents declare "done" without verification Breaks when agents crash or hit context limits You become

Dev.to AI

6m29 minutes ago

ModelsLive

Functional Emotions in Large Language Models: What Anthropic Found Inside Claude

Based on: Sofroniew, Kauvar, Saunders, Chen et al., “Emotion Concepts and their Function in a Large Language Model,” Transformer Circuits… Continue reading on Medium »

Medium AI

1m15 minutes ago

ProductsLive

How AI Is Changing the Way We Build Online Businesses

Not long ago, building an online business meant: months of development hiring developers large upfront costs Today? AI has completely changed the game. Now, one person can go from idea → to revenue faster than ever before. And this shift is just getting started. ⚠️ The Old Way vs The New Way Before AI: Build everything from scratch Spend weeks on infrastructure Launch slowly Iterate even slower With AI: Build faster Automate key tasks Launch quickly Iterate in real time The difference is massive. 🧠 AI Is Reducing the Cost of Building One of the biggest changes: 👉 Building is no longer the bottleneck AI helps with: generating content writing code automating workflows handling repetitive tasks What used to take weeks… 👉 now takes days ⚙️ Infrastructure Is No Longer the Hard Part Another s

Dev.to AI

3m17 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 235 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Submission history

Daily AI Digest

More about

Choosing an AI Agent Orchestrator in 2026: A Practical Comparison

Functional Emotions in Large Language Models: What Anthropic Found Inside Claude

How AI Is Changing the Way We Build Online Businesses

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

The 10 Claude “Plugins” You Actually Need in 2026

Functional Emotions in Large Language Models: What Anthropic Found Inside Claude

Exclusive | Pentagon Used Anthropic’s Claude in Maduro Venezuela Raid - WSJ

Google 'Gemma 4' AI model: This new AI tool can build AI agents for you and handle text, image, audio tasks - MSN