Live
Black Hat USADark ReadingBlack Hat AsiaAI Business1 Artificial Intelligence (AI) Software Stock to Buy Hand Over Fist Before It Soars 62%, According to Dan Ives - The Motley FoolGoogle News: AIGroup Pushing Age Verification Requirements for AI Turns Out to Be Sneakily Backed by OpenAIGizmodoGroup Pushing Age Verification Requirements for AI Turns Out to Be Sneakily Backed by OpenAI - GizmodoGoogle News: OpenAIAnthropic Executive Sees Cowork Agent as Bigger Than Claude CodeBloomberg TechnologyABAP OOP Design Patterns — Part 2: Factory, Observer, and Decorator Patterns in Real SAP SystemsDEV CommunityWhy Your AI Agent Health Check Is Lying to YouDEV CommunityDeep Dive: Array Internals & Memory LayoutDEV CommunityIllinois Tech computer science researcher honored by IEEE Chicago Section - EurekAlert!Google News: Machine LearningWhy AI Agents Need Both Memory and MoneyDEV CommunityMarch 2026: LangChain NewsletterLangChain BlogIntuit's AI agents hit 85% repeat usage. The secret was keeping humans involvedVentureBeat AIThe reputation of troubled YC startup Delve has gotten even worseTechCrunchBlack Hat USADark ReadingBlack Hat AsiaAI Business1 Artificial Intelligence (AI) Software Stock to Buy Hand Over Fist Before It Soars 62%, According to Dan Ives - The Motley FoolGoogle News: AIGroup Pushing Age Verification Requirements for AI Turns Out to Be Sneakily Backed by OpenAIGizmodoGroup Pushing Age Verification Requirements for AI Turns Out to Be Sneakily Backed by OpenAI - GizmodoGoogle News: OpenAIAnthropic Executive Sees Cowork Agent as Bigger Than Claude CodeBloomberg TechnologyABAP OOP Design Patterns — Part 2: Factory, Observer, and Decorator Patterns in Real SAP SystemsDEV CommunityWhy Your AI Agent Health Check Is Lying to YouDEV CommunityDeep Dive: Array Internals & Memory LayoutDEV CommunityIllinois Tech computer science researcher honored by IEEE Chicago Section - EurekAlert!Google News: Machine LearningWhy AI Agents Need Both Memory and MoneyDEV CommunityMarch 2026: LangChain NewsletterLangChain BlogIntuit's AI agents hit 85% repeat usage. The secret was keeping humans involvedVentureBeat AIThe reputation of troubled YC startup Delve has gotten even worseTechCrunch

Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs

arXivby [Submitted on 27 Mar 2026]March 31, 20262 min read1 views
Source Quiz

arXiv:2603.26828v1 Announce Type: cross Abstract: Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second — Seine A. Shintani

View PDF HTML (experimental)

Abstract:Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in training, and ask why 3-digit generalization still fails. The failure is staged. First, there is a layout barrier: a learned absolute-position model collapses under a pure 3-digit layout shift, and mixed-layout exposure is the only intervention that materially weakens this barrier. Second, after layout repair, the hundreds position behaves like a carry flag rather than a semantic hundreds digit; targeted carry probes reverse the relevant logit margin, whereas a matched extra-data control does not. Third, after carry repair, the main remaining bottleneck is conditional recomposition: high-conditioned tail data outperforms a matched control, high-only data, and tail-only data on all true-3-digit suites, and the same ordering reappears in a larger 2-layer bridge experiment. The residual errors after recomposition are then overwhelmingly tens-only, and a separate 10-seed late-stage study shows that a sign-aware tens repair raises exact match on the hardest thousands-carry suite from 0.664 to 0.822. We therefore provide an experimentally testable decomposition of arithmetic OOD failure into layout, carry-semantics, recomposition, and late tens-residual stages.

Comments: 16 pages, 4 figures

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

ACM classes: I.2.7

Cite as: arXiv:2603.26828 [cs.CL]

(or arXiv:2603.26828v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.26828

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Seine A. Shintani [view email] [v1] Fri, 27 Mar 2026 03:40:27 UTC (1,038 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Arithmetic …researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 171 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers