Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Missing Data Problem Behind Broken Computer-Use AgentsHackernoon AINVIDIA’s $2 billion sprinkler remaking the AI supply chain - Asia TimesGNews AI NVIDIAAmazon, Apple, and Nvidia can't make AI chips without this company. Here's why its growth stock could soar. - MSNGNews AI NVIDIAI am building a Notebook Environment for SQL Inside a Database ClientDEV CommunityA Production Readiness Checklist for Remote MCP ServersDEV CommunityNginx + PHP + MySQL Optimisations and Parameter CalculationsDEV CommunityDo You Actually Need an AI Gateway? (And When a Simple LLM Wrapper Isn’t Enough)DEV CommunityPowerShell Scripts Every MSP Should UseDEV CommunityThe way I see it — The development of autonomous vehicles is fraught with ethical concerns. And: The notion that the separatiDev.to AIFull-Stack E-Commerce App - Part 1: Project setupDEV CommunityThe Architect’s Reflection: The 5D MiddlewareMedium AII Am a Software Engineer Teaching Myself AI Engineering. Here Is Where I Am Starting.Medium AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Missing Data Problem Behind Broken Computer-Use AgentsHackernoon AINVIDIA’s $2 billion sprinkler remaking the AI supply chain - Asia TimesGNews AI NVIDIAAmazon, Apple, and Nvidia can't make AI chips without this company. Here's why its growth stock could soar. - MSNGNews AI NVIDIAI am building a Notebook Environment for SQL Inside a Database ClientDEV CommunityA Production Readiness Checklist for Remote MCP ServersDEV CommunityNginx + PHP + MySQL Optimisations and Parameter CalculationsDEV CommunityDo You Actually Need an AI Gateway? (And When a Simple LLM Wrapper Isn’t Enough)DEV CommunityPowerShell Scripts Every MSP Should UseDEV CommunityThe way I see it — The development of autonomous vehicles is fraught with ethical concerns. And: The notion that the separatiDev.to AIFull-Stack E-Commerce App - Part 1: Project setupDEV CommunityThe Architect’s Reflection: The 5D MiddlewareMedium AII Am a Software Engineer Teaching Myself AI Engineering. Here Is Where I Am Starting.Medium AI
AI NEWS HUBbyEIGENVECTOREigenvector

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

ArXiv CS.AIby Aaditya Khanal, Yangyang Tao, Junxiu ZhouApril 1, 20262 min read0 views
Source Quiz

arXiv:2603.29231v1 Announce Type: new Abstract: Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-st

View PDF HTML (experimental)

Abstract:Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability.

Comments: 23 pages, 4 figures

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29231 [cs.AI]

(or arXiv:2603.29231v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29231

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Aaditya Khanal [view email] [v1] Tue, 31 Mar 2026 03:56:39 UTC (490 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Beyond pass…modelbenchmarkannounceversionproductvaluationArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 140 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models