Models model benchmark announce version product valuation

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

ArXiv CS.AIby Aaditya Khanal, Yangyang Tao, Junxiu ZhouApril 1, 20262 min read0 views

arXiv:2603.29231v1 Announce Type: new Abstract: Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-st

View PDF HTML (experimental)

Abstract:Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability.

Comments: 23 pages, 4 figures

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29231 [cs.AI]

(or arXiv:2603.29231v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29231

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Aaditya Khanal [view email] [v1] Tue, 31 Mar 2026 03:56:39 UTC (490 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2603.29231

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarkannounce

ReleasesFresh

Microsoft AI launches launches three new models for text voice and image generation - Storyboard18

Microsoft AI launches launches three new models for text voice and image generation Storyboard18

GNews AI Microsoft

1mabout 3 hours ago

ModelsFresh

A day has passed which is a decade in the ai world - is qwen 3.5 27b q6 still the best model to run on a 5090, or does the new bonsai and gemma models beat it?

Im specifically interested in coding ability. I have the q6 version of the claude opus 4.6 distill with 128k context for local coding (Still using claude opus for planning) and it works amazingly. Im a tech junkie, good enough is never good enough, are these new models better? submitted by /u/ArugulaAnnual1765 [link] [comments]

Reddit r/LocalLLaMA

1mabout 4 hours ago

Models

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

GNews AI energy

1m3 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 140 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

GNews AI energy

1m3 days ago

ModelsFresh

A day has passed which is a decade in the ai world - is qwen 3.5 27b q6 still the best model to run on a 5090, or does the new bonsai and gemma models beat it?

Reddit r/LocalLLaMA

1mabout 4 hours ago

ModelsFresh

Gemma 4 is seriously broken when using Unsloth and llama.cpp

Hi! Just checking, am I the only one who has serious issues with Gemma 4 locally? I've played around with Gemma 4 using Unsloth quants on llama.cpp, and it's seriously broken. I'm using the latest changes from llama.cpp, along with the reccomended temperature, top-p and top-k. Giving it an article and asking it to list all typos along with the corrected version gives total nonsense. Here is a random news article I tested it with: https://www.bbc.com/news/articles/ce843ge47z4o I've tried the 26B MoE, I've tried the 31B, and I've tried UD-Q8_K_XL, Q8_0, and UD-Q4_K_XL. They all have the same issue. As a control, I tested the same thing in Google AI Studio, and there the models work great, finding actual typos instead of the nonsense I get locally. submitted by /u/Tastetrykker [link] [comment

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsLive

Show HN: AI tool to merge people from two photos into one realistic group photo

At a friend's gathering, someone mentioned wanting to add her late father into a family photo. I figured this would be trivial — modern image models are powerful, just send two photos and ask the AI to merge them. She said she'd tried, but the results were weird and unstable, and she didn't know how to describe what she wanted. I went home and tried it myself. With a well-written prompt and two good photos, it works. But real-world use cases aren't two good photos — it's a modern family photo plus a damaged old portrait, or two old photos from different decades. That's when things fall apart. I looked at existing tools. Most showcase merges between clean, well-lit, modern photos. Nobody was solving the hard version: mismatched eras, damaged sources, different poses, different formality lev

Hacker News AI Top

2mabout 1 hour ago