Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
Hey there, little explorer! Imagine you're at a party, and lots of friends are talking and playing.
This new computer game, called Omni-MMSI, helps robots be super good listeners and watchers!
It's like teaching a robot to know who is saying "hello," who is pointing at the yummy cake, and who they are talking to.
Right now, robots sometimes get confused and don't know who is who. But this new game helps them learn to figure out all these things, just like you do!
So, robots can understand parties better and be super helpful friends! Isn't that cool?
arXiv:2604.00267v1 Announce Type: new Abstract: We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social
View PDF HTML (experimental)
Abstract:We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: this https URL.
Comments: Accepted to CVPR 2026. Project page: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2604.00267 [cs.CV]
(or arXiv:2604.00267v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2604.00267
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Xinpeng Li [view email] [v1] Tue, 31 Mar 2026 21:49:52 UTC (2,141 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
announceassistantreasoning
OpenClaw Changed How We Use AI. KiloClaw Made It Effortless to Get Started
OpenClaw is a powerful open-source AI agent, but self-hosting it is a pain. KiloClaw is OpenClaw fully hosted and managed by Kilo — sign up, connect your chat apps, and your agent is running in about a minute. No Docker, no YAML, no server babysitting. People are using it for personalized morning briefs, inbox digests, auto-building CRMs, browser automation, GitHub triage, and more. Hosting is $8/month with a 7-day free trial, inference runs through Kilo Gateway at zero markup across 500+ models, and it's free for open-source maintainers. Read All

Orientation Matters: Learning Radiation Patterns of Multi-Rotor UAVs In-Flight to Enhance Communication Availability Modeling
arXiv:2604.02827v1 Announce Type: new Abstract: The paper presents an approach for learning antenna Radiation Patterns (RPs) of a pair of heterogeneous quadrotor Uncrewed Aerial Vehicles (UAVs) by calibration flight data. RPs are modeled either as a Spherical Harmonics series or as a weighted average over inducing samples. Linear regression of polynomial coefficients simultaneously decouples the two independent UAVs' RPs. A joint calibration trajectory exploits available flight time in an obstacle-free anechoic altitude. Evaluation on a real-world dataset demonstrates the feasibility of learning both radiation patterns, achieving 3.6 dB RMS error, the measurement noise level. The proposed RP learning and decoupling can be exploited in rapid recalibration upon payload changes, thereby enabl

Goal-Conditioned Neural ODEs with Guaranteed Safety and Stability for Learning-Based All-Pairs Motion Planning
arXiv:2604.02821v1 Announce Type: new Abstract: This paper presents a learning-based approach for all-pairs motion planning, where the initial and goal states are allowed to be arbitrary points in a safe set. We construct smooth goal-conditioned neural ordinary differential equations (neural ODEs) via bi-Lipschitz diffeomorphisms. Theoretical results show that the proposed model can provide guarantees of global exponential stability and safety (safe set forward invariance) regardless of goal location. Moreover, explicit bounds on convergence rate, tracking error, and vector field magnitude are established. Our approach admits a tractable learning implementation using bi-Lipschitz neural networks and can incorporate demonstration data. We illustrate the effectiveness of the proposed method
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

OpenAI Bought TBPN Because PR Can’t Keep Up With AI
OpenAI is deeply embedded in political and social discourse because AI itself is. A traditional way to shape public opinion, PR, is not effective enough to help OpenAI deal with the kinds of challenges the company faces. So OpenAI is buying a media company, TBPN. Read All

How to Build a Voice Agent With AssemblyAI
This tutorial shows you how to build a complete voice agent that can have natural conversations with users. You'll create an application that listens to speech, processes it with AI, and responds back with voice—handling the full conversation loop in real-time. Read All

I Shipped an AI SaaS in 4 Hours. Here Is the Exact Stack.
I Shipped an AI SaaS in 4 Hours. Here Is the Exact Stack. Every AI SaaS project starts the same way. You have a great idea. You open your editor. Then you spend three weeks on auth, Stripe integration, a dashboard, and a landing page — none of which is your actual product. I built a kit that eliminates that. Here is the exact stack and what each piece does. The Stack next.js 14 (App Router) tailwind css stripe billing nextauth openai / claude api routes prisma + postgresql What Comes Pre-Wired Authentication (NextAuth) // app/api/auth/[...nextauth]/route.ts import NextAuth from " next-auth " import { authOptions } from " @/lib/auth " const handler = NextAuth ( authOptions ) export { handler as GET , handler as POST } Google OAuth, GitHub OAuth, and email/password — all configured. Sessions



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!