Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
arXiv:2604.02543v1 Announce Type: new Abstract: As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling o
View PDF HTML (experimental)
Abstract:As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling, reduce calibration error and consistently outperform the prompt-based strategy. Third, due to their (strict) monotonicity, these post-hoc calibration methods are inherently limited in improving the discriminative quality of predictions, leaving AUROC at the same level. Motivated by these findings, we investigate hallucination-aware calibration (HAC), which incorporates vision-grounded hallucination detection signals as complementary inputs to refine confidence estimates. We find that leveraging these hallucination signals improves both calibration and AUROC, with the largest gains on open-ended questions. Overall, our findings suggest post-hoc calibration as standard practice for medical VLM deployment over raw confidence estimates, and highlight the practical usefulness of hallucination signals to enable more reliable use of VLMs in medical VQA.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2604.02543 [cs.CV]
(or arXiv:2604.02543v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2604.02543
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Ji Young Byun [view email] [v1] Thu, 2 Apr 2026 21:52:29 UTC (332 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
ASUS ZenBook A16 review: A surprisingly light and powerful 16-inch ultraportable
I went into 2025 incredibly amped to review ASUS's 2.2-pound ZenBook A14 , only to be disappointed by the lackluster performance of its Snapdragon X chip. I'm still a bit heartbroken, to be honest. But now ASUS is back with a revamped A14 and a larger 16-inch ZenBook A16, both of which sport new Snapdragon X2 Elite Extreme chips. The A16 is slightly heavier at 2.6 pounds and 2.9 pounds, but it still competes with LG’s Gram Pro 16. And thanks to Qualcomm's new hardware, it's actually speedy enough to compete with Intel and AMD ultraportables. Take everything I initially liked about the ZenBook A14 last year — its thin design and portability, its smooth "Ceraluminum" case and its bold OLED screen — stretch it out a bit, and you've got the ZenBook A16. There's an understated beauty to its des
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!