Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessAssembly of 59 Best of Sensors 2026 finalists announcedFierce ElectronicsDorsey makes the AI case against managersThe Rundown AIKyndryl rolls out Agentic Service Management for AI-driven processesTech MonitorHow have you used tech to support your or your parents' aging and caregiving journeys? We want to hear from you.Business InsiderCan we block fresh accounts from posting?Reddit r/LocalLLaMAAlibaba Launches Qwen3.6-Plus For Enterprise AI Applications - DataconomyGNews AI AlibabaBuilding a Fully Local RAG System with Qdrant and OllamaDev.to AIIs PAIO bot the ultimate cheat code for OpenClaw? (We just launched) 🚀Dev.to AIGemini Is Sending More Traffic Than Perplexity. Here’s the Bigger Signal for MarketersMedium AIClaude Code memory: how to survive a 200k context window filling upDev.to AIGood News for CarPlay Users: The ChatGPT Revolution Begins Now With an Official App - autoevolutionGoogle News: ChatGPTNobody Talks About This ChatGPT Money HackMedium AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessAssembly of 59 Best of Sensors 2026 finalists announcedFierce ElectronicsDorsey makes the AI case against managersThe Rundown AIKyndryl rolls out Agentic Service Management for AI-driven processesTech MonitorHow have you used tech to support your or your parents' aging and caregiving journeys? We want to hear from you.Business InsiderCan we block fresh accounts from posting?Reddit r/LocalLLaMAAlibaba Launches Qwen3.6-Plus For Enterprise AI Applications - DataconomyGNews AI AlibabaBuilding a Fully Local RAG System with Qdrant and OllamaDev.to AIIs PAIO bot the ultimate cheat code for OpenClaw? (We just launched) 🚀Dev.to AIGemini Is Sending More Traffic Than Perplexity. Here’s the Bigger Signal for MarketersMedium AIClaude Code memory: how to survive a 200k context window filling upDev.to AIGood News for CarPlay Users: The ChatGPT Revolution Begins Now With an Official App - autoevolutionGoogle News: ChatGPTNobody Talks About This ChatGPT Money HackMedium AI
AI NEWS HUBbyEIGENVECTOREigenvector

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28387v1 Announce Type: new Abstract: Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, w — Doan Nam Long Vu, Simone Balloccu

View PDF HTML (experimental)

Abstract:Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

Subjects:

Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2603.28387 [cs.AI]

(or arXiv:2603.28387v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.28387

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Doan Nam Long Vu [view email] [v1] Mon, 30 Mar 2026 12:58:10 UTC (2,077 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
The Scaffol…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 166 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers