Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechST’s smart IMU bolsters Qualcomm’s monster AI chip for wearablesFierce ElectronicsRound three: More Rising Stars 2026Fierce ElectronicsQ/A: How engineers must design AVs to drive safelyFierce ElectronicsBosch’s pressure sensor is part of Qualcomm’s new wearables chipFierce ElectronicsQ/A: Lumotive CTO talks software-defined optical sensingFierce ElectronicsOpenAI contract with U.S. Cyber Command went unnoticed amid degradation of transparency and veracity of U.S. procurement database - All-Source Intelligence | Jack PoulsonGoogle News: OpenAIEDITORIAL: Benefits of generative AI do not outweigh drawbacks - The Daily TargumGoogle News: Generative AIHere's the severance package Oracle offered laid-off US employeesBusiness InsiderTeenager died after asking ChatGPT for ‘most successful’ way to take his life, inquest told - The GuardianGoogle News: ChatGPTBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIChild Soldiers in Tehran: Iran’s Security Crisis DeepensDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechST’s smart IMU bolsters Qualcomm’s monster AI chip for wearablesFierce ElectronicsRound three: More Rising Stars 2026Fierce ElectronicsQ/A: How engineers must design AVs to drive safelyFierce ElectronicsBosch’s pressure sensor is part of Qualcomm’s new wearables chipFierce ElectronicsQ/A: Lumotive CTO talks software-defined optical sensingFierce ElectronicsOpenAI contract with U.S. Cyber Command went unnoticed amid degradation of transparency and veracity of U.S. procurement database - All-Source Intelligence | Jack PoulsonGoogle News: OpenAIEDITORIAL: Benefits of generative AI do not outweigh drawbacks - The Daily TargumGoogle News: Generative AIHere's the severance package Oracle offered laid-off US employeesBusiness InsiderTeenager died after asking ChatGPT for ‘most successful’ way to take his life, inquest told - The GuardianGoogle News: ChatGPTBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIChild Soldiers in Tehran: Iran’s Security Crisis DeepensDev.to AI

Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.27752v1 Announce Type: new Abstract: Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. R — Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Mu\~noz

View PDF HTML (experimental)

Abstract:Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.

Subjects:

Computation and Language (cs.CL); Software Engineering (cs.SE)

ACM classes: D.2.5; I.2.7; H.3.3

Cite as: arXiv:2603.27752 [cs.CL]

(or arXiv:2603.27752v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.27752

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Boxi Yu [view email] [v1] Sun, 29 Mar 2026 16:12:18 UTC (500 KB)

Original source

arXiv

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Retromorphi…researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 175 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers