Live
Black Hat USAAI BusinessBlack Hat AsiaAI Business2026世界杯Dev.to AItama96Dev.to AIThe All-in-One Local AI App: Chat + Images + Video Without the CloudDev.to AIClaude Code Just Fixed Terminal Flickering (How to Enable NO_FLICKER Mode)Dev.to AIHow to use a Claude Subscription in Cursor without paying for API tokensDev.to AIHow I Built a Desktop AI App with Tauri v2 + React 19 in 2026Dev.to AIAnthropic Source Code Leak: What Was Exposed & Why It Matters in AI Security - iZOOlogicGoogle News: ClaudeAI Agents in Production: Why Most Systems Break (And How to Fix Them)Dev.to AISome editors 'uploading confidential manuscripts to ChatGPT to read quickly', agent claims - The BooksellerGoogle News: ChatGPTWhere is the AI revolution at?Dev.to AIDigital Marketing Course in Delhi: A Practical Skill Stack for High-Growth Careers in 2026Dev.to AIMercor AI Data Breach: Supply Chain Attack via LiteLLM Package CompromiseDev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI Business2026世界杯Dev.to AItama96Dev.to AIThe All-in-One Local AI App: Chat + Images + Video Without the CloudDev.to AIClaude Code Just Fixed Terminal Flickering (How to Enable NO_FLICKER Mode)Dev.to AIHow to use a Claude Subscription in Cursor without paying for API tokensDev.to AIHow I Built a Desktop AI App with Tauri v2 + React 19 in 2026Dev.to AIAnthropic Source Code Leak: What Was Exposed & Why It Matters in AI Security - iZOOlogicGoogle News: ClaudeAI Agents in Production: Why Most Systems Break (And How to Fix Them)Dev.to AISome editors 'uploading confidential manuscripts to ChatGPT to read quickly', agent claims - The BooksellerGoogle News: ChatGPTWhere is the AI revolution at?Dev.to AIDigital Marketing Course in Delhi: A Practical Skill Stack for High-Growth Careers in 2026Dev.to AIMercor AI Data Breach: Supply Chain Attack via LiteLLM Package CompromiseDev.to AI

Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

arXivMarch 26, 202610 min read0 views
Source Quiz

Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions base — Gregor Baer, Chao Zhang, Isel Grau

View PDF HTML (experimental)

Abstract:Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

Comments: 24 pages, 9 figures, 2 tables

Subjects:

Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2603.25251 [cs.HC]

(or arXiv:2603.25251v1 [cs.HC] for this version)

https://doi.org/10.48550/arXiv.2603.25251

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Gregor Baer [view email] [v1] Thu, 26 Mar 2026 09:55:10 UTC (401 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Does Explan…researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 225 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers