Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessAnthropic Found Emotion Circuits Inside Claude. They're Causing It to Blackmail People.DEV CommunityUnderstanding Transformers Part 1: How Transformers Understand Word OrderDEV CommunityI built an iOS app at 50 using AI tools. Here's what actually workedDEV CommunityDesign Cost-Optimized Compute SolutionsDEV CommunityCodeClone b4: from CLI tool to a real review surface for VS Code, Claude Desktop, and CodexDEV CommunityHow to Publish a Power BI Report and Embed it into a Website.DEV CommunityKVerify: A Two-Year Journey to Get Validation RightDEV CommunityHow I Used Swarm Intelligence to Catch a Race Condition Before It Hit ProductionDEV CommunityDark Dish Lab: A Cursed Recipe GeneratorDEV CommunityUpload Large Folders to Cloudflare R2DEV Community10x Genomics (TXG) Is Up 14.6% After Analyst Upgrade Highlights AI-Scale Spatial Genomics Initiative - simplywall.stGNews AI genomicsWhy Developer Productivity Engineering is UnderratedDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessAnthropic Found Emotion Circuits Inside Claude. They're Causing It to Blackmail People.DEV CommunityUnderstanding Transformers Part 1: How Transformers Understand Word OrderDEV CommunityI built an iOS app at 50 using AI tools. Here's what actually workedDEV CommunityDesign Cost-Optimized Compute SolutionsDEV CommunityCodeClone b4: from CLI tool to a real review surface for VS Code, Claude Desktop, and CodexDEV CommunityHow to Publish a Power BI Report and Embed it into a Website.DEV CommunityKVerify: A Two-Year Journey to Get Validation RightDEV CommunityHow I Used Swarm Intelligence to Catch a Race Condition Before It Hit ProductionDEV CommunityDark Dish Lab: A Cursed Recipe GeneratorDEV CommunityUpload Large Folders to Cloudflare R2DEV Community10x Genomics (TXG) Is Up 14.6% After Analyst Upgrade Highlights AI-Scale Spatial Genomics Initiative - simplywall.stGNews AI genomicsWhy Developer Productivity Engineering is UnderratedDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling

arXivby [Submitted on 28 Mar 2026]March 31, 20262 min read1 views
Source Quiz

arXiv:2603.27403v1 Announce Type: cross Abstract: Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \emph{marginal} guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \emph{Conditional Factuality Control} (CFC), a post-hoc conformal framework that returns \emph{set-valued} outputs with \emph{conditional} coverage guarantees. CFC defines a continuous, feature-conditional acceptance threshold through au — Kai Ye, Qingtao Pan, Shuo Li

View PDF HTML (experimental)

Abstract:Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \emph{marginal} guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \emph{Conditional Factuality Control} (CFC), a post-hoc conformal framework that returns \emph{set-valued} outputs with \emph{conditional} coverage guarantees. CFC defines a continuous, feature-conditional acceptance threshold through augmented quantile regression on a latent ``success'' score, and deploys it through a fixed-point threshold rule at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its \emph{efficiency}, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most $O(\sqrt{\log(1/\delta)/N})$. Empirically, on synthetic data, real-world reasoning and QA benchmarks, and a Flickr8k VLM setting, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.

Comments: CVPR 2026

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.27403 [cs.LG]

(or arXiv:2603.27403v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.27403

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Kai Ye [view email] [v1] Sat, 28 Mar 2026 20:48:13 UTC (950 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Conditional…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 137 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!