Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechFilevine Emphasizes Ethical AI and Autonomous Systems in Legal Tech Strategy - TipRanksGNews AI ethicsPriced Out by AI: The Memory Chip Crisis Hitting Every ConsumerHacker News AI TopShow HN: AgentDog – Open-source dashboard for monitoring local AI agentsHacker News AI TopAI Enforcement Accelerates as Federal Policy Stalls and States Step In - Morgan LewisGNews AI USAGemma 4 and Qwen3.5 on shared benchmarksReddit r/LocalLLaMA[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on BlackwellReddit r/MachineLearningThe energy and environmental impact of AI and how it undermines democracy - greenpeace.orgGNews AI energyShow HN: A TUI for checking and comparing cloud and AI pricingHacker News AI TopAttorney General Pam Bondi pushed outAxios TechShow HN: Screenbox – Self-hosted virtual desktops for AI agentsHacker News AI TopBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechFilevine Emphasizes Ethical AI and Autonomous Systems in Legal Tech Strategy - TipRanksGNews AI ethicsPriced Out by AI: The Memory Chip Crisis Hitting Every ConsumerHacker News AI TopShow HN: AgentDog – Open-source dashboard for monitoring local AI agentsHacker News AI TopAI Enforcement Accelerates as Federal Policy Stalls and States Step In - Morgan LewisGNews AI USAGemma 4 and Qwen3.5 on shared benchmarksReddit r/LocalLLaMA[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on BlackwellReddit r/MachineLearningThe energy and environmental impact of AI and how it undermines democracy - greenpeace.orgGNews AI energyShow HN: A TUI for checking and comparing cloud and AI pricingHacker News AI TopAttorney General Pam Bondi pushed outAxios TechShow HN: Screenbox – Self-hosted virtual desktops for AI agentsHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2602.11391v3 Announce Type: replace Abstract: Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health — Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin, Vladimir Franzuela Cardenas, Farrokh Alemi, Kevin Lybarger

View PDF

Abstract:Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health literacy and condition-specific communication; and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Profiles were evaluated against NIST AI RMF trustworthiness requirements and assessed against an AI Decision Aid for antidepressant selection. Results: Across 500 simulated conversations, the simulator revealed monotonic degradation in AI Decision Aid performance across health literacy levels: Rank-1 concept retrieval ranged from 47.6% (limited) to 81.9% (proficient), with corresponding recommendation degradation. Medical concept fidelity was high (96.6% across 8,210 concepts), validated by human annotators (0.73 kappa) and an LLM judge with comparable agreement (0.78 kappa). Behavioral profiles were reliably distinguished (0.93 kappa), and linguistic profiles showed moderate agreement (0.61 kappa). Conclusions: The simulator exposes measurable performance risks in conversational healthcare AI. Health literacy emerged as a primary risk factor with direct implications for equitable AI deployment.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2602.11391 [cs.CL]

(or arXiv:2602.11391v3 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.11391

arXiv-issued DOI via DataCite

Submission history

From: Md Tanvir Rouf Shawon [view email] [v1] Wed, 11 Feb 2026 21:53:18 UTC (1,309 KB) [v2] Wed, 25 Mar 2026 17:20:58 UTC (1,253 KB) [v3] Fri, 27 Mar 2026 15:00:31 UTC (1,253 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Advancing A…researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 91 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!