Models model language model benchmark release announce valuation

Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

arXiv cs.CLby [Submitted on 31 Mar 2026]April 1, 20262 min read1 views

arXiv:2603.29373v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria t

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2603.29373 [cs.CL]

(or arXiv:2603.29373v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29373

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yahan Li [view email] [v1] Tue, 31 Mar 2026 07:42:07 UTC (999 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2603.29373

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Research PapersLive

Token-Level Forking Paths in Reasoning Traces: Some Examples

Epistemic status: No rigorous findings, I'm only really looking at a few examples here. But this gave me a more intuitive feel for how LLM reasoning traces work in a way that's difficult to fully explain. Intro I've recently read the papers: " Thought Anchors ", " Forking Paths " and " Critical Tokens Matter ". I found myself insanely curious as to what parts of the reasoning trace effect the final model output. I wanted to look at these trace resamples on the token level, so I figured I'd burn a bit of money to get some cool examples. Background If you're unfamiliar with these papers, I would recommend reading them, they're incredibly interesting! Long story short, one of the important things these papers did is on policy resampling of reasoning traces. I'm really interested in what part

LessWrong

5mabout 2 hours ago

Frontier ResearchLive

“Alignment” and “Safety”, part one: What is “AI Safety”?

If you’re already familiar with the history of the field, you might wanna skip this one… I like to imagine future historians trying to follow the discourse around AI during the time I’ve been in the field… “Wait, so the AI ethics people think that the AI safety people are the same as the accelerationists and hate them? And the accelerationists think the safety people are the same as the ethicists and hate them ? And the AI safety people want to be friends with both of them!?” In a recent conversation with a researcher, they told me: “Yeah, I work on that, but I just do alignment, not that crazy safety stuff”. Five years ago, they might’ve said the opposite! When I wrote my PhD thesis in 2021, I said: > Until recently, “AI safety” was the most commonly used term for technical work on reduci

LessWrong

4m17 minutes ago

ModelsLive

Data Annotation Services — AIPersonic

AIPersonic provides professional data annotation services designed to support machine learning and artificial intelligence models with… Continue reading on Medium »

Medium AI

1m40 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 238 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

Data Annotation Services — AIPersonic

AIPersonic provides professional data annotation services designed to support machine learning and artificial intelligence models with… Continue reading on Medium »

Medium AI

1m40 minutes ago

ModelsLive

LLM Accuracy vs Reproducibility: Are We Measuring Capability or Sampling Luck?

Why identical prompts can produce different reasoning paths — and why that matters for evaluation LLM Accuracy vs Reproducibility: Are We Measuring Capability or Sampling Luck? When working with LLMs, we often rely on metrics like accuracy, pass rates, or benchmark scores to evaluate performance. But a simple experiment reveals something that’s easy to overlook. The Setup Same prompt Same model snapshot Same temperature Same sampling configuration Run the same input multiple times. The Observation The outputs don’t just vary slightly. They often follow completely different reasoning paths. In some cases, the structure of the response changes significantly — different intermediate steps, different logic, different phrasing. And yet: The final answer may still be the same. Why This Matters M

Dev.to AI

2m26 minutes ago

Models

Claude is Gaining on OpenClaw - The Information

Claude is Gaining on OpenClaw The Information

Google News: Claude

1m12 days ago

Models

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

Anthropic Races to Contain Leak of Code Behind Claude AI Agent WSJ

Google News: Claude

1m6 days ago