Releases benchmark release announce valuation report policy

ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems

arXiv cs.SEby [Submitted on 2 Apr 2026]April 3, 20261 min read1 views

arXiv:2604.01508v1 Announce Type: new Abstract: Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remai

View PDF HTML (experimental)

Abstract:Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remains limited under the released authorization and hard failure settings.

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.01508 [cs.SE]

(or arXiv:2604.01508v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.01508

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Rista Baral [view email] [v1] Thu, 2 Apr 2026 00:42:29 UTC (8 KB)

Original source

arXiv cs.SE

https://arxiv.org/abs/2604.01508

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkreleaseannounce

ProductsFresh

Tesla s Texas factory workforce reportedly shrunk 22% in 2025

Tesla's headcount fell from 21,191 workers to 16,506 workers in 2025, according to a report, as it grappled with its second straight year of declining sales.

TechCrunch AI

1mabout 9 hours ago

ModelsLive

Cost-Efficient Estimation of General Abilities Across Benchmarks

arXiv:2604.01418v1 Announce Type: new Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique

arXiv cs.CL

2mabout 1 hour ago

ModelsLive

Adaptive Stopping for Multi-Turn LLM Reasoning

arXiv:2604.01413v1 Announce Type: new Abstract: Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP

arXiv cs.CL

2mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 187 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesRecent

Artificial Intelligence - AI Update, April 3, 2026: AI News and Views From the Past Week - MarketingProfs

Artificial Intelligence - AI Update, April 3, 2026: AI News and Views From the Past Week MarketingProfs

Google News: AI

1mabout 12 hours ago

ReleasesLive

Using AI as your therapist?

We live in a world where we are scared of sharing our true feelings with others. Continue reading on Medium »

Medium AI

1mabout 1 hour ago

ReleasesFresh

How 2 downed jets show a critical vulnerability for the US as Iran war rages on

One crew member from a US fighter jet shot down over Iran has been rescued by US forces, multiple news outlets reported on Friday, citing two US officials, while a second crew member remains missing. A separate US aircraft was also hit near the Strait of Hormuz, though its pilot was rescued safely, according to the reports. Iran on Friday claimed to have shot down an American fighter jet, releasing photos of apparent wreckage of an F-15E, while the United States reportedly launched a...

SCMP Tech (Asia AI)

1mabout 12 hours ago

ReleasesLive

Stop Writing Rules for AI Agents

Stop Writing Rules for AI Agents Every developer building AI agents makes the same mistake: they write rules. "Don't do X." "Always do Y." Rules feel like control. But they are an illusion. Why Rules Fail Rules are static. Agents operate in dynamic environments. The moment reality diverges from your rule set it breaks. Behavior Over Rules Instead of telling your agent what NOT to do, design what it IS: The system prompt (identity, not restrictions) The tools available (capability shapes behavior) The feedback loops (what gets rewarded) The memory architecture A Real Example I built FORGE, an autonomous AI agent running 24/7. Early versions had dozens of rules. Every rule created a new edge case. The fix: stop writing rules, start designing behavior. FORGE's identity: orchestrator, not exec

Dev.to AI

1mabout 1 hour ago