Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity
arXiv:2601.14252v5 Announce Type: replace-cross Abstract: Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. We characterize how much additional information must be supplied to recover precise identity from such representations. The answer is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $\pi$. Let $A_{\pi}=\max_u |\pi^{-1}(u)|$ be the largest collision fiber. We prove a tight fixed-length converse $L \ge \log_2 A_{\pi}$, an exact finite
View PDF HTML (experimental)
Abstract:Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. We characterize how much additional information must be supplied to recover precise identity from such representations. The answer is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $\pi$. Let $A_{\pi}=\max_u |\pi^{-1}(u)|$ be the largest collision fiber. We prove a tight fixed-length converse $L \ge \log_2 A_{\pi}$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |\pi^{-1}(u)|\rceil$, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula $D^\star(L)=\max(0,1-2^L/a)$ appears as a closed-form special case when all mass lies on one collision block, where $a = A_{\pi}$ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.
Comments: 13 pages, 2 tables. Lean 4 artifact and supplementary material available at this https URL
Subjects:
Information Theory (cs.IT); Programming Languages (cs.PL)
MSC classes: 94A15, 94A24, 05B35
ACM classes: E.4; G.2.1
Cite as: arXiv:2601.14252 [cs.IT]
(or arXiv:2601.14252v5 [cs.IT] for this version)
https://doi.org/10.48550/arXiv.2601.14252
arXiv-issued DOI via DataCite
Submission history
From: Tristan Simas [view email] [v1] Tue, 20 Jan 2026 18:58:51 UTC (177 KB) [v2] Thu, 22 Jan 2026 01:11:26 UTC (177 KB) [v3] Fri, 20 Feb 2026 21:52:16 UTC (196 KB) [v4] Mon, 16 Mar 2026 23:06:17 UTC (373 KB) [v5] Tue, 31 Mar 2026 15:29:35 UTC (383 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
announcearxiv
Anthropic cuts off the ability to use Claude subscriptions with OpenClaw and third-party AI agents
Are you a subscriber to Anthropic's Claude Pro ($20 monthly) or Max ($100-$200 monthly) plans and use its Claude AI models and products to power third-party AI agents like OpenClaw ? If so, you're in for an unpleasant surprise. Anthropic announced a few hours ago that starting tomorrow, Saturday, April 4, 2026, at 12 pm PT/3 pm ET, it will no longer be possible for those Claude subscribers to use their subscriptions to hook Anthropic's Claude models up to third-party agentic tools, citing the strain such usage was placing on Anthropic's compute and engineering resources, and desire to serve a wide number of users reliably. "We’ve been working hard to meet the increase in demand for Claude, and our subscriptions weren't built for the usage patterns of these third-pa

Karpathy shares 'LLM Knowledge Base' architecture that bypasses RAG with an evolving markdown library maintained by AI
AI vibe coders have yet another reason to thank Andrej Karpathy , the coiner of the term. The former Director of AI at Tesla and co-founder of OpenAI, now running his own independent AI project, recently posted on X describing a "LLM Knowledge Bases" approach he's using to manage various topics of research interest. By building a persistent, LLM-maintained record of his projects, Karpathy is solving the core frustration of "stateless" AI development: the dreaded context-limit reset. As anyone who has vibe coded can attest, hitting a usage limit or ending a session often feels like a lobotomy for your project. You’re forced to spend valuable tokens (and time) reconstructing context for the AI, hoping it "remembers" the architectural nuances you just established. Karpathy proposes somet
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Releases

How 2 downed jets show a critical vulnerability for the US as Iran war rages on
One crew member from a US fighter jet shot down over Iran has been rescued by US forces, multiple news outlets reported on Friday, citing two US officials, while a second crew member remains missing. A separate US aircraft was also hit near the Strait of Hormuz, though its pilot was rescued safely, according to the reports. Iran on Friday claimed to have shot down an American fighter jet, releasing photos of apparent wreckage of an F-15E, while the United States reportedly launched a...

Stop Writing Rules for AI Agents
Stop Writing Rules for AI Agents Every developer building AI agents makes the same mistake: they write rules. "Don't do X." "Always do Y." Rules feel like control. But they are an illusion. Why Rules Fail Rules are static. Agents operate in dynamic environments. The moment reality diverges from your rule set it breaks. Behavior Over Rules Instead of telling your agent what NOT to do, design what it IS: The system prompt (identity, not restrictions) The tools available (capability shapes behavior) The feedback loops (what gets rewarded) The memory architecture A Real Example I built FORGE, an autonomous AI agent running 24/7. Early versions had dozens of rules. Every rule created a new edge case. The fix: stop writing rules, start designing behavior. FORGE's identity: orchestrator, not exec



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!