Research Papers research paper arxiv machine-learning deep-learning

Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

arXivMarch 26, 202610 min read0 views

Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions base — Gregor Baer, Chao Zhang, Isel Grau

View PDF HTML (experimental)

Abstract:Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

Comments: 24 pages, 9 figures, 2 tables

Subjects:

Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as: arXiv:2603.25251 [cs.HC]

(or arXiv:2603.25251v1 [cs.HC] for this version)

https://doi.org/10.48550/arXiv.2603.25251

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Gregor Baer [view email] [v1] Thu, 26 Mar 2026 09:55:10 UTC (401 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25251v1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m1 day ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m1 day ago

ProductsRecent

Inside the Together AI kernels team

The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.

Together AI Blog

1m1 day ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 225 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersRecent

[D] Does seeing the identify of authors influence your scoring?

Let's be honest, at some stage of the review process. A lot of us have gotten bored and tried to Google the papers we are reviewing. And sometimes those papers might have already been uploaded onto arXiv with the identity of the authors. Which we then tried to look them up. As a first-time reviewer, I noticed the top 2 papers in my batch happened to be the only papers in my batch that is on arXiv. I am trying to work out if revealing the author's identity had influenced my decision. Or it's just a coincidence. submitted by /u/d_edge_sword [link] [comments]

Reddit r/MachineLearning

1mabout 20 hours ago

Research PapersFresh

Leveraging Commit Size Context and Hyper Co-Change Graph Centralities for Defect Prediction

arXiv:2604.01132v1 Announce Type: new Abstract: File-level defect prediction models traditionally rely on product and process metrics. While process metrics effectively complement product metrics, they often overlook commit size the number of files changed per commit despite its strong association with software quality. Network centrality measures on dependency graphs have also proven to be valuable product level indicators. Motivated by this, we first redefine process metrics as commit size aware process metric vectors, transforming conventional scalar measures into 100 dimensional profiles that capture the distribution of changes across commit size strata. We then model change history as a hyper co change graph, where hyperedges naturally encode commit-size semantics. Vector centralities

arXiv cs.SE

1mabout 4 hours ago

Research PapersFresh

Detecting Call Graph Unsoundness without Ground Truth

arXiv:2604.00885v1 Announce Type: new Abstract: Java static analysis frameworks are commonly compared under the assumption that analysis algorithms and configurations compose monotonically and yield semantically comparable results across tools. In this work, we show that this assumption is fundamentally flawed. We present a large-scale empirical study of semantic consistency within and across four widely used Java static analysis frameworks: Soot, SootUp, WALA, and Doop. Using precision partial orders over analysis algorithms and configurations, we systematically identify violations where increased precision introduces new call-graph edges or amplifies inconsistencies. Our results reveal three key findings. First, algorithmic precision orders frequently break within frameworks due to moder

arXiv cs.SE

1mabout 4 hours ago

Research PapersFresh

Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

arXiv:2604.01072v1 Announce Type: new Abstract: Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated exe

arXiv cs.SE

2mabout 4 hours ago