Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessArtificial intelligence hype cycle risks collapse with market implications - MSNGoogle News: AIMeta paused its work with AI training startup Mercor after a data breachBusiness Insider[R], 31 MILLIONS High frequency data, Light GBM worked perfectlyReddit r/MachineLearningConsidering NeurIPS submission [D]Reddit r/MachineLearningAutomate Your Handyman Pricing: The True Hourly Cost AI ForgetsDev.to AIScience Is Not a Reading ProblemMedium AIHow Antigravity AI Changed My React Workflow (In Ways I Didn’t Expect)Medium AIToken Usage Is the New RAM UsageDev.to AIStop Writing Rules for AI AgentsDev.to AIUsing AI as your therapist?Medium AIDigital Marketing Trends and the Role of AI in Modern Business StrategiesMedium AIThe AI Pen: Collaborating With Artificial Intelligence Without Losing Your Unique VoiceMedium AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessArtificial intelligence hype cycle risks collapse with market implications - MSNGoogle News: AIMeta paused its work with AI training startup Mercor after a data breachBusiness Insider[R], 31 MILLIONS High frequency data, Light GBM worked perfectlyReddit r/MachineLearningConsidering NeurIPS submission [D]Reddit r/MachineLearningAutomate Your Handyman Pricing: The True Hourly Cost AI ForgetsDev.to AIScience Is Not a Reading ProblemMedium AIHow Antigravity AI Changed My React Workflow (In Ways I Didn’t Expect)Medium AIToken Usage Is the New RAM UsageDev.to AIStop Writing Rules for AI AgentsDev.to AIUsing AI as your therapist?Medium AIDigital Marketing Trends and the Role of AI in Modern Business StrategiesMedium AIThe AI Pen: Collaborating With Artificial Intelligence Without Losing Your Unique VoiceMedium AI
AI NEWS HUBbyEIGENVECTOREigenvector

ToolMisuseBench: An Offline Deterministic Benchmark for Tool Misuse and Recovery in Agentic Systems

arXiv cs.SEby [Submitted on 2 Apr 2026]April 3, 20261 min read1 views
Source Quiz

arXiv:2604.01508v1 Announce Type: new Abstract: Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remai

View PDF HTML (experimental)

Abstract:Tool using agents often fail for operational reasons even when language understanding is strong. Common causes include invalid arguments, interface drift, weak recovery, and inefficient retry behavior. We introduce ToolMisuseBench, an offline deterministic benchmark for evaluating tool misuse and recovery under explicit step, call, and retry budgets. The benchmark covers CRUD, retrieval, file, and scheduling environments with replayable fault injection. It reports success, invalid call behavior, policy violations, recovery quality, and budgeted efficiency. We release a public dataset with 6800 tasks and a reproducible evaluation pipeline. Baseline results show fault specific recovery gains for schema aware methods, while overall success remains limited under the released authorization and hard failure settings.

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.01508 [cs.SE]

(or arXiv:2604.01508v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.01508

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Rista Baral [view email] [v1] Thu, 2 Apr 2026 00:42:29 UTC (8 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkreleaseannounce

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ToolMisuseB…benchmarkreleaseannouncevaluationreportpolicyarXiv cs.SE

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 187 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Releases