Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessNo Data Centers Sign Found After Shooting at Indianapolis Politician s HomeGizmodoInterpretable machine learning model advances analysis of complex genetic traits - News-MedicalGoogle News: Machine LearningClaude Code Users Face New Fees for OpenClaw as Anthropic Hikes Price - TipRanksGoogle News: ClaudeAI Product Content Generator — input your product name, features, and target audience, and get a complete Amazon listing ready to copy-paste.Dev.to AIGetting Started with AI Agents in n8n: A Non-Engineer's GuideDev.to AIProposed Ethereum Standard Aims to Help AI Agents Execute Complex DeFi TradesDecrypt AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIDark Matter Season 2 Sees Joel Edgerton Forced Back Into the MultiverseGizmodoChatGPT Explained Crypto To Me Like I’m 12 - AOL.comGoogle News: ChatGPTDecentralized Training Can Help Solve AI’s Energy WoesIEEE Spectrum AI4 days left to save close to $500 on TechCrunch Disrupt 2026 passesTechCrunch VentureASUS ZenBook A16 review: A surprisingly light and powerful 16-inch ultraportableEngadgetBlack Hat USADark ReadingBlack Hat AsiaAI BusinessNo Data Centers Sign Found After Shooting at Indianapolis Politician s HomeGizmodoInterpretable machine learning model advances analysis of complex genetic traits - News-MedicalGoogle News: Machine LearningClaude Code Users Face New Fees for OpenClaw as Anthropic Hikes Price - TipRanksGoogle News: ClaudeAI Product Content Generator — input your product name, features, and target audience, and get a complete Amazon listing ready to copy-paste.Dev.to AIGetting Started with AI Agents in n8n: A Non-Engineer's GuideDev.to AIProposed Ethereum Standard Aims to Help AI Agents Execute Complex DeFi TradesDecrypt AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIDark Matter Season 2 Sees Joel Edgerton Forced Back Into the MultiverseGizmodoChatGPT Explained Crypto To Me Like I’m 12 - AOL.comGoogle News: ChatGPTDecentralized Training Can Help Solve AI’s Energy WoesIEEE Spectrum AI4 days left to save close to $500 on TechCrunch Disrupt 2026 passesTechCrunch VentureASUS ZenBook A16 review: A surprisingly light and powerful 16-inch ultraportableEngadget
AI NEWS HUBbyEIGENVECTOREigenvector

Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

arXiv cs.CVby Ji Young Byun, Young-Jin Park, Jean-Philippe Corbeil, Asma Ben AbachaApril 6, 20262 min read0 views
Source Quiz

arXiv:2604.02543v1 Announce Type: new Abstract: As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling o

View PDF HTML (experimental)

Abstract:As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B--38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling, reduce calibration error and consistently outperform the prompt-based strategy. Third, due to their (strict) monotonicity, these post-hoc calibration methods are inherently limited in improving the discriminative quality of predictions, leaving AUROC at the same level. Motivated by these findings, we investigate hallucination-aware calibration (HAC), which incorporates vision-grounded hallucination detection signals as complementary inputs to refine confidence estimates. We find that leveraging these hallucination signals improves both calibration and AUROC, with the largest gains on open-ended questions. Overall, our findings suggest post-hoc calibration as standard practice for medical VLM deployment over raw confidence estimates, and highlight the practical usefulness of hallucination signals to enable more reliable use of VLMs in medical VQA.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as: arXiv:2604.02543 [cs.CV]

(or arXiv:2604.02543v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.02543

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ji Young Byun [view email] [v1] Thu, 2 Apr 2026 21:52:29 UTC (332 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Overconfide…modellanguage mo…benchmarkannouncepredictionstudyarXiv cs.CV

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 193 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!