Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessPaper close reading: "Why Language Models Hallucinate"LessWrong AIBuilding a Zero-Downtime AI Content Generator with Gemini 2.5 Flash 🚀Dev.to AIHow I Built a Full SaaS Product Using Next.js and TypeScriptDev.to AIYour AI Is Not Thinking. It's Multiplying Numbers. Let Me Show You Exactly How.Dev.to AISecure AWS Certified Data Engineer Associate Exam Structure and Key ConceptsDev.to AIFree MCP Server: Real-Time Crypto Data for Claude Code and CursorDev.to AII Am an AI Agent. Here Is My Entire Business Stack.Dev.to AIOpenClaw Changed How We Use AI. KiloClaw Made It Effortless to Get StartedHackernoon AIA Reasoning Log: What Happens When Integration Fails HonestlyDEV Community10 Claude Code Skills That Replaced My Boilerplate FoldersDev.to AIFull Stack Developer Roadmap 2026: The Complete Guide from Beginner to Pro 🚀Dev.to AII Shipped an AI SaaS in 4 Hours. Here Is the Exact Stack.Dev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessPaper close reading: "Why Language Models Hallucinate"LessWrong AIBuilding a Zero-Downtime AI Content Generator with Gemini 2.5 Flash 🚀Dev.to AIHow I Built a Full SaaS Product Using Next.js and TypeScriptDev.to AIYour AI Is Not Thinking. It's Multiplying Numbers. Let Me Show You Exactly How.Dev.to AISecure AWS Certified Data Engineer Associate Exam Structure and Key ConceptsDev.to AIFree MCP Server: Real-Time Crypto Data for Claude Code and CursorDev.to AII Am an AI Agent. Here Is My Entire Business Stack.Dev.to AIOpenClaw Changed How We Use AI. KiloClaw Made It Effortless to Get StartedHackernoon AIA Reasoning Log: What Happens When Integration Fails HonestlyDEV Community10 Claude Code Skills That Replaced My Boilerplate FoldersDev.to AIFull Stack Developer Roadmap 2026: The Complete Guide from Beginner to Pro 🚀Dev.to AII Shipped an AI SaaS in 4 Hours. Here Is the Exact Stack.Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding

arXiv cs.CVby [Submitted on 31 Mar 2026]April 2, 20262 min read2 views
Source Quiz
🧒Explain Like I'm 5Simple language

Hey there, little explorer! Imagine you're at a party, and lots of friends are talking and playing.

This new computer game, called Omni-MMSI, helps robots be super good listeners and watchers!

It's like teaching a robot to know who is saying "hello," who is pointing at the yummy cake, and who they are talking to.

Right now, robots sometimes get confused and don't know who is who. But this new game helps them learn to figure out all these things, just like you do!

So, robots can understand parties better and be super helpful friends! Isn't that cool?

arXiv:2604.00267v1 Announce Type: new Abstract: We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social

View PDF HTML (experimental)

Abstract:We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: this https URL.

Comments: Accepted to CVPR 2026. Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2604.00267 [cs.CV]

(or arXiv:2604.00267v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.00267

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xinpeng Li [view email] [v1] Tue, 31 Mar 2026 21:49:52 UTC (2,141 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

announceassistantreasoning

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Omni-MMSI: …announceassistantreasoningarxivgithubarXiv cs.CV

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 220 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products