Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessObservabilidade de agentes de IA com LangChain4jDEV CommunityI Ranked on Google's First Page in 6 Weeks — Here's Every SEO Tactic I Used (Part 2)DEV CommunityI Built a macOS Terminal That Detects Your AI Coding Agents — Here's WhyDEV CommunityA whistleblower alleges Delve pitched a modified copy of open-source no-code tool SimStudio as its own, a practice that could violate the software's license (Julie Bort/TechCrunch)TechmemeQA Risk Register & Mitigation PlansDEV CommunityAxios Hijack Post-Mortem: How to Audit, Pin, and Automate a DefenseDEV CommunityHow to Monitor Your AI Agent's Performance and CostsDEV CommunityHow to Use the ES2026 Temporal API in Node.js REST APIs (2026 Guide)DEV Community缓存架构深度指南:如何设计高性能缓存系统DEV CommunityMCP TravelCode: Let AI Assistants Search Flights and Book HotelsDEV CommunityI Read OpenAI Codex's Source and Built My Workflow Around ItDEV CommunityGoing out with a whimperLessWrong AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessObservabilidade de agentes de IA com LangChain4jDEV CommunityI Ranked on Google's First Page in 6 Weeks — Here's Every SEO Tactic I Used (Part 2)DEV CommunityI Built a macOS Terminal That Detects Your AI Coding Agents — Here's WhyDEV CommunityA whistleblower alleges Delve pitched a modified copy of open-source no-code tool SimStudio as its own, a practice that could violate the software's license (Julie Bort/TechCrunch)TechmemeQA Risk Register & Mitigation PlansDEV CommunityAxios Hijack Post-Mortem: How to Audit, Pin, and Automate a DefenseDEV CommunityHow to Monitor Your AI Agent's Performance and CostsDEV CommunityHow to Use the ES2026 Temporal API in Node.js REST APIs (2026 Guide)DEV Community缓存架构深度指南:如何设计高性能缓存系统DEV CommunityMCP TravelCode: Let AI Assistants Search Flights and Book HotelsDEV CommunityI Read OpenAI Codex's Source and Built My Workflow Around ItDEV CommunityGoing out with a whimperLessWrong AI

Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.25752v1 Announce Type: new Abstract: In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality — Ying Liu, Yuntao Shou, Wei Ai, Tao Meng, Keqin Li

View PDF HTML (experimental)

Abstract:In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.

Comments: 19 pages

Subjects:

Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Cite as: arXiv:2603.25752 [cs.CL]

(or arXiv:2603.25752v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.25752

arXiv-issued DOI via DataCite

Journal reference: neurocomputing2026

Related DOI:

https://doi.org/10.1016/j.neucom.2026.133306

DOI(s) linking to related resources

Submission history

From: Yuntao Shou [view email] [v1] Sun, 22 Mar 2026 01:27:22 UTC (4,353 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Relational …researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 178 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers