Information-Theoretic Limits of Safety Verification for Self-Improving Systems

arXivby [Submitted on 30 Mar 2026 (v1), last revised 2 Apr 2026 (this version, v2)]March 31, 20262 min read1 views

arXiv:2603.28650v1 Announce Type: cross Abstract: Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequ — Arsenios Scrivens

View PDF HTML (experimental)

Abstract:Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) -- and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder's inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) -- subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier's ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].

Comments: 27 pages, 6 figures. Companion empirical paper: doi:https://doi.org/10.5281/zenodo.19237566

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Cite as: arXiv:2603.28650 [cs.LG]

(or arXiv:2603.28650v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.28650

arXiv-issued DOI via DataCite

Related DOI:

https://doi.org/10.5281/zenodo.19237451

DOI(s) linking to related resources

Submission history

From: Arsenios Scrivens [view email] [v1] Mon, 30 Mar 2026 16:34:37 UTC (136 KB) [v2] Thu, 2 Apr 2026 00:23:37 UTC (136 KB)

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Submission history

Daily AI Digest

Knowledge Map

Connected Articles — Knowledge Graph

Discussion