Research Papers research paper arxiv machine-learning deep-learning

MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

arXivMarch 31, 202610 min read0 views

arXiv:2603.03192v2 Announce Type: replace-cross Abstract: Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant — Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani

View PDF HTML (experimental)

Abstract:Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

Comments: CVPR 2026. Project Page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as: arXiv:2603.03192 [cs.CV]

(or arXiv:2603.03192v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.03192

arXiv-issued DOI via DataCite

Submission history

From: Ashutosh Chaubey [view email] [v1] Tue, 3 Mar 2026 17:50:24 UTC (3,570 KB) [v2] Fri, 27 Mar 2026 20:27:35 UTC (3,571 KB)

Original source

arXiv

https://arxiv.org/abs/2603.03192

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work

Article URL: https://arxiv.org/abs/2505.10742 Comments URL: https://news.ycombinator.com/item?id=47595100 Points: 1 # Comments: 0

Hacker News AI Top

2m44 minutes ago

ModelsLive

We're running an AI-authored research workshop for Northeast India's 200+ languages - and publishing everything openly

At MWire Labs, we build language technology for Northeast India's indigenous languages - ASR, MT, OCR, LLMs. The region has 200+ languages. Almost none of them exist in mainstream AI datasets. So we're doing something a bit unusual. NortheastGenAI 2026 is a virtual workshop on May 29 where every submission must be AI-generated or AI-assisted - with full disclosure of how. All reviews are AI-assisted too, followed by a human editorial check. Everything is public on OpenReview. Inspired by Agents4Science 2025 (Stanford). We're not claiming AI research is ready. We're asking the question openly and publishing whatever comes out. *Three tracks: * Language, Culture & Heritage Society, History & Anthropology AI and Technology for NE In

Dev.to AI

1m28 minutes ago

Research Papers

Antonia Georgopoulou starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m6 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Precision Proactivity: Measuring Cognitive Load in Real-World AI-Assisted Work

Article URL: https://arxiv.org/abs/2505.10742 Comments URL: https://news.ycombinator.com/item?id=47595100 Points: 1 # Comments: 0

Hacker News AI Top

2m44 minutes ago

Research PapersFresh

Iran’s Revolutionary Guards just named 18 US tech firms as military targets. The age of the civilian data centre is over.

At 8pm Tehran time on Tuesday, a new kind of front line was drawn, not through desert terrain or along a disputed border, but through the server farms, cloud regions, and corporate campuses of America’s largest technology companies. The Islamic Revolutionary Guard Corps published a statement on its official Sepah News channel naming 18 US […] This story continues at The Next Web

The Next Web AI

1mabout 3 hours ago

Research Papers

Antonia Georgopoulou starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m6 months ago

Research Papers

Real-time speech-to-speech translation - research.google

<a href="https://news.google.com/rss/articles/CBMid0FVX3lxTFAxeFFhNVhOTjVXeEhXeGFHOXE3WENYeGFISjlpVGNueGtDS2ZZTEVsZHh6dkhLc191aFFYNEpMUUxraV9uTWF6YW1RcF9VTFlIZDBuQTlpbkhBRnJxU1FuTGY4aEtFc2FEaWMxekxUTnlzV3dFN1ow?oc=5" target="_blank">Real-time speech-to-speech translation</a> research.google

GNews AI voice

1m4 months ago