Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
arXiv:2603.27201v1 Announce Type: new Abstract: Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of halluc — Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
View PDF HTML (experimental)
Abstract:Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at this https URL.
Comments: CVPR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.27201 [cs.CV]
(or arXiv:2603.27201v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.27201
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Ji Ma [view email] [v1] Sat, 28 Mar 2026 08:56:19 UTC (4,378 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivSpeech LLMs are Contextual Reasoning Transcribers
arXiv:2604.00610v1 Announce Type: new Abstract: Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-genera
[P] Federated Adversarial Learning
I'm a CS/ML engineering student in my 4th year, and I need help for a project I recently got assigned to (as an "end of the year" project). I am familiar with basic ML stuff, deep learning etc and made a few "standard" projects here and there about it... However I found this topic a bit challenging since it combines both FL and the adversarial aspect, I did a lot of research especially on arxiv to try to understand the gist of it. HOWEVER, the subject is essentially "federated adversarial learning" and I am struggeling to understand what I'm supposed to do. (I found ONE article on arxiv but ngl i find it very hard to understand as it is very theoritical.) I talked to my teachers/supervisors about this but they said "do whatever you want" which doesn't help AT ALL..... They did provide a da
[R] The SPORE Clustering Algorithm
https://preview.redd.it/di99yw56tksg1.png?width=992 format=png auto=webp s=8828c9459dcf8f8541718e4d7a9fae52bfc0b95a I created a clustering algorithm SPORE ( S keleton P ropagation O ver R ecalibrating E xpansions) for general purpose clustering, intended to handle nonconvex, convex, low-d and high-d data alike. I've benchmarked it on 28 datasets from 2-784D and released a Python package as well as a research paper . Short Summary SPORE is a density-variance-based method meant for general clustering in arbitrary geometries and dimensionalities. After building a knn graph, it has 2 phases. Phase 1 (Expansion) uses BFS with a continually refined density-variance constraint to expand initial clusters in a way that adapts to their specific scale. The aim is to capture inner, well-shielded skele
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
[D] Does seeing the identify of authors influence your scoring?
Let's be honest, at some stage of the review process. A lot of us have gotten bored and tried to Google the papers we are reviewing. And sometimes those papers might have already been uploaded onto arXiv with the identity of the authors. Which we then tried to look them up. As a first-time reviewer, I noticed the top 2 papers in my batch happened to be the only papers in my batch that is on arXiv. I am trying to work out if revealing the author's identity had influenced my decision. Or it's just a coincidence. submitted by /u/d_edge_sword [link] [comments]

Leveraging Commit Size Context and Hyper Co-Change Graph Centralities for Defect Prediction
arXiv:2604.01132v1 Announce Type: new Abstract: File-level defect prediction models traditionally rely on product and process metrics. While process metrics effectively complement product metrics, they often overlook commit size the number of files changed per commit despite its strong association with software quality. Network centrality measures on dependency graphs have also proven to be valuable product level indicators. Motivated by this, we first redefine process metrics as commit size aware process metric vectors, transforming conventional scalar measures into 100 dimensional profiles that capture the distribution of changes across commit size strata. We then model change history as a hyper co change graph, where hyperedges naturally encode commit-size semantics. Vector centralities

Detecting Call Graph Unsoundness without Ground Truth
arXiv:2604.00885v1 Announce Type: new Abstract: Java static analysis frameworks are commonly compared under the assumption that analysis algorithms and configurations compose monotonically and yield semantically comparable results across tools. In this work, we show that this assumption is fundamentally flawed. We present a large-scale empirical study of semantic consistency within and across four widely used Java static analysis frameworks: Soot, SootUp, WALA, and Doop. Using precision partial orders over analysis algorithms and configurations, we systematically identify violations where increased precision introduces new call-graph edges or amplifies inconsistencies. Our results reveal three key findings. First, algorithmic precision orders frequently break within frameworks due to moder

Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks
arXiv:2604.01072v1 Announce Type: new Abstract: Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated exe
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!