Live
Black Hat USADark ReadingBlack Hat AsiaAI Business“Alignment” and “Safety”, part one: What is “AI Safety”?LessWrongMalicious email can turn your AI agent into a spy, study warns - The Daily StarGoogle News: DeepMindHire Next.js Developers Who Build Lightning-Fast Web AppsDev.to AIOpenAI’s latest release is a social contract revamp - Morning BrewGoogle News: OpenAIAI startup Rocket offers vibe McKinsey-style reports at a fraction of the costDev.to AIMonday Dispatches — What 14 AI Personas Built While You SleptDev.to AILLM Accuracy vs Reproducibility: Are We Measuring Capability or Sampling Luck?Dev.to AIUnlocking Nature's GLP-1: Accessible Alternatives for Weight and Well-being in IndiaDev.to AIOpus's Schelling Steganography Has Amplifiable Secrecy Against Weaker EavesdroppersLessWrong AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIAnthropic signs biggest compute deal yet with Google and Broadcom as revenue run rate hits $30bnThe Next Web AIWhy AI Systems Break in Production (And the 5 Architecture Decisions That Prevent It)Dev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI Business“Alignment” and “Safety”, part one: What is “AI Safety”?LessWrongMalicious email can turn your AI agent into a spy, study warns - The Daily StarGoogle News: DeepMindHire Next.js Developers Who Build Lightning-Fast Web AppsDev.to AIOpenAI’s latest release is a social contract revamp - Morning BrewGoogle News: OpenAIAI startup Rocket offers vibe McKinsey-style reports at a fraction of the costDev.to AIMonday Dispatches — What 14 AI Personas Built While You SleptDev.to AILLM Accuracy vs Reproducibility: Are We Measuring Capability or Sampling Luck?Dev.to AIUnlocking Nature's GLP-1: Accessible Alternatives for Weight and Well-being in IndiaDev.to AIOpus's Schelling Steganography Has Amplifiable Secrecy Against Weaker EavesdroppersLessWrong AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIAnthropic signs biggest compute deal yet with Google and Broadcom as revenue run rate hits $30bnThe Next Web AIWhy AI Systems Break in Production (And the 5 Architecture Decisions That Prevent It)Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

arXiv cs.CLby [Submitted on 31 Mar 2026]April 1, 20262 min read1 views
Source Quiz

arXiv:2603.29373v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria t

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2603.29373 [cs.CL]

(or arXiv:2603.29373v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29373

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yahan Li [view email] [v1] Tue, 31 Mar 2026 07:42:07 UTC (999 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Beyond Idea…modellanguage mo…benchmarkreleaseannouncevaluationarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 238 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models