Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
arXiv:2603.29373v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria t
View PDF HTML (experimental)
Abstract:Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2603.29373 [cs.CL]
(or arXiv:2603.29373v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.29373
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Yahan Li [view email] [v1] Tue, 31 Mar 2026 07:42:07 UTC (999 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
Token-Level Forking Paths in Reasoning Traces: Some Examples
Epistemic status: No rigorous findings, I'm only really looking at a few examples here. But this gave me a more intuitive feel for how LLM reasoning traces work in a way that's difficult to fully explain. Intro I've recently read the papers: " Thought Anchors ", " Forking Paths " and " Critical Tokens Matter ". I found myself insanely curious as to what parts of the reasoning trace effect the final model output. I wanted to look at these trace resamples on the token level, so I figured I'd burn a bit of money to get some cool examples. Background If you're unfamiliar with these papers, I would recommend reading them, they're incredibly interesting! Long story short, one of the important things these papers did is on policy resampling of reasoning traces. I'm really interested in what part

“Alignment” and “Safety”, part one: What is “AI Safety”?
If you’re already familiar with the history of the field, you might wanna skip this one… I like to imagine future historians trying to follow the discourse around AI during the time I’ve been in the field… “Wait, so the AI ethics people think that the AI safety people are the same as the accelerationists and hate them? And the accelerationists think the safety people are the same as the ethicists and hate them ? And the AI safety people want to be friends with both of them!?” In a recent conversation with a researcher, they told me: “Yeah, I work on that, but I just do alignment, not that crazy safety stuff”. Five years ago, they might’ve said the opposite! When I wrote my PhD thesis in 2021, I said: > Until recently, “AI safety” was the most commonly used term for technical work on reduci
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

LLM Accuracy vs Reproducibility: Are We Measuring Capability or Sampling Luck?
Why identical prompts can produce different reasoning paths — and why that matters for evaluation LLM Accuracy vs Reproducibility: Are We Measuring Capability or Sampling Luck? When working with LLMs, we often rely on metrics like accuracy, pass rates, or benchmark scores to evaluate performance. But a simple experiment reveals something that’s easy to overlook. The Setup Same prompt Same model snapshot Same temperature Same sampling configuration Run the same input multiple times. The Observation The outputs don’t just vary slightly. They often follow completely different reasoning paths. In some cases, the structure of the response changes significantly — different intermediate steps, different logic, different phrasing. And yet: The final answer may still be the same. Why This Matters M




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!