Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG
arXiv:2603.27752v1 Announce Type: new Abstract: Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. R — Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Mu\~noz
View PDF HTML (experimental)
Abstract:Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.
Subjects:
Computation and Language (cs.CL); Software Engineering (cs.SE)
ACM classes: D.2.5; I.2.7; H.3.3
Cite as: arXiv:2603.27752 [cs.CL]
(or arXiv:2603.27752v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.27752
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Boxi Yu [view email] [v1] Sun, 29 Mar 2026 16:12:18 UTC (500 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests - PsyPost
<a href="https://news.google.com/rss/articles/CBMiowFBVV95cUxQTC13Zm5WZG9iQmRmZFpXM0ctamxRZ3E5N0ZFRDlIOWhHa2l1eTVmaTBzZFR6VGx6QjR2VEd0SFZoY0l4ZXVPSVF1c3FDTnE4Nk5zanNMWXhyLVpxVVlIUEZhZUFxXzYtQkRnM2E5eTN5M3NYUmJsX09YMTR6dWhLQ2hUSk55S2FJQXV4WFEzVFB4ZzhyS21RUzFoMDJzSXpQQ1pR?oc=5" target="_blank">ChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests</a> <font color="#6f6f6f">PsyPost</font>
Gemini Deep Think: Redefining the Future of Scientific Research - Google DeepMind
<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxPRmtMZnRYNW04a3Q4b0dSQm9aall0S3BJWFFOczQ3dmdfX3cyR1plYlotZHg5ekhlZ2s3cUd6Y1pyT3lkVEJrV1V0c0NWVlBQNHlMbXlEbXpTYWlSVUZHVllYZWdSb2RMU2JTelVGMXBEckZSdWt5VUs1d24zdUVLaExpS0NZMmtpSTRoNDd2MHRZdlBRaWxSWmVTNk0wRWtRQ2NaV2ln?oc=5" target="_blank">Gemini Deep Think: Redefining the Future of Scientific Research</a> <font color="#6f6f6f">Google DeepMind</font>
Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance
<a href="https://news.google.com/rss/articles/CBMijwFBVV95cUxOYTZwZk0walRzazJQampab1FCM2k4Uy1SYk12UWZraENkUXYzZU9kbnlGTGZJS0pFaTZIUFlKZFkwVnJkRzhKbXhNV3lNdUZpdF8tSU1LMklqcTZlUDZERDZ3VzdWbjNQYUN4T2d2ZkRQT1R1MUc0LXdYNndPQTNzbXBXMXJhb3ZEZE00ZFMtaw?oc=5" target="_blank">Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push</a> <font color="#6f6f6f">Yahoo Finance</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance
<a href="https://news.google.com/rss/articles/CBMijwFBVV95cUxOYTZwZk0walRzazJQampab1FCM2k4Uy1SYk12UWZraENkUXYzZU9kbnlGTGZJS0pFaTZIUFlKZFkwVnJkRzhKbXhNV3lNdUZpdF8tSU1LMklqcTZlUDZERDZ3VzdWbjNQYUN4T2d2ZkRQT1R1MUc0LXdYNndPQTNzbXBXMXJhb3ZEZE00ZFMtaw?oc=5" target="_blank">Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push</a> <font color="#6f6f6f">Yahoo Finance</font>
Is AI's visual understanding mostly a 'mirage'? New research suggests so. - Fortune
<a href="https://news.google.com/rss/articles/CBMihgFBVV95cUxORGxTdWF3bnBiU0VaUEVtanJCT1htWVdjTUo3UnJycUxKcl9HU3Q1ODNINW9na1R0aENXXzhGYnc0Qlg3aGFGM2hiTVNFSjBZQ2FPUElZYmVGdzhfU0d5QkR2cDVnSzJBd2Y5WEVMLUJfWHY4YUc5c1I1U1dUQW9TeU56U1JjQQ?oc=5" target="_blank">Is AI's visual understanding mostly a 'mirage'? New research suggests so.</a> <font color="#6f6f6f">Fortune</font>
71% of Businesses Are Invisible to AI - And Most Don't Know It Yet
Search didn't evolve - it got replaced. AI systems don't return links, they return answers. New research shows 71% of businesses are invisible to AI recommendation engines. Brand size doesn't matter. Vanguard scores 16/100 while ProtonMail scores 88. The playbook that won Google doesn't work here. Clarity beats clout. Read All

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!