Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessExclusive: Miravoice, Builder Of An AI ‘Interviewer’ To Conduct Phone Surveys, Raises $6.3MCrunchbase NewsMaul: Shadow Lord Will Return for Season 2GizmodoA jury says Meta and Google hurt a kid. What now?The Verge AIHow Disney Imagineers are using AI and robotics to reshape the company’s theme parksFast Company TechOpenAI Teams Up with Smartly to Create Chatty Ads Inside ChatGPT - TipRanksGoogle News: ChatGPTDOJ to Appeal Court Order Halting Trump’s Ban on Anthropic AIBloomberg TechnologyCapacity and speed: why TikTok shelved its second Irish data centreSilicon RepublicAI Uses Eye-Tracking Technology to Diagnose Autism - EMJGoogle News: Machine LearningDiverse teams start with diverse VCsTechCrunch AIChatGPT contractor building deradicalization chatbot after school shooter scandal - cybernews.comGoogle News: ChatGPTThis even smaller credit card-sized e-reader has one tragic flawThe VergeWhat history can teach us about AI - Johns Hopkins UniversityGNews AI USABlack Hat USAAI BusinessBlack Hat AsiaAI BusinessExclusive: Miravoice, Builder Of An AI ‘Interviewer’ To Conduct Phone Surveys, Raises $6.3MCrunchbase NewsMaul: Shadow Lord Will Return for Season 2GizmodoA jury says Meta and Google hurt a kid. What now?The Verge AIHow Disney Imagineers are using AI and robotics to reshape the company’s theme parksFast Company TechOpenAI Teams Up with Smartly to Create Chatty Ads Inside ChatGPT - TipRanksGoogle News: ChatGPTDOJ to Appeal Court Order Halting Trump’s Ban on Anthropic AIBloomberg TechnologyCapacity and speed: why TikTok shelved its second Irish data centreSilicon RepublicAI Uses Eye-Tracking Technology to Diagnose Autism - EMJGoogle News: Machine LearningDiverse teams start with diverse VCsTechCrunch AIChatGPT contractor building deradicalization chatbot after school shooter scandal - cybernews.comGoogle News: ChatGPTThis even smaller credit card-sized e-reader has one tragic flawThe VergeWhat history can teach us about AI - Johns Hopkins UniversityGNews AI USA
AI NEWS HUBbyEIGENVECTOREigenvector

AgentFixer: From Failure Detection to Fix Recommendations in LLM Agentic Systems

ArXiv CS.AIby [Submitted on 18 Feb 2026]April 1, 20262 min read1 views
Source Quiz

arXiv:2603.29848v1 Announce Type: new Abstract: We introduce a comprehensive validation framework for LLM-based agentic systems that provides systematic diagnosis and improvement of reliability failures. The framework includes fifteen failure-detection tools and two root-cause analysis modules that jointly uncover weaknesses across input handling, prompt design, and output generation. It integrates lightweight rule-based checks with LLM-as-a-judge assessments to support structured incident detection, classification, and repair. We applied the framework to IBM CUGA, evaluating its performance on the AppWorld and WebArena benchmarks. The analysis revealed recurrent planner misalignments, schema violations, brittle prompt dependencies, and more. Based on these insights, we refined both prompt

View PDF HTML (experimental)

Abstract:We introduce a comprehensive validation framework for LLM-based agentic systems that provides systematic diagnosis and improvement of reliability failures. The framework includes fifteen failure-detection tools and two root-cause analysis modules that jointly uncover weaknesses across input handling, prompt design, and output generation. It integrates lightweight rule-based checks with LLM-as-a-judge assessments to support structured incident detection, classification, and repair. We applied the framework to IBM CUGA, evaluating its performance on the AppWorld and WebArena benchmarks. The analysis revealed recurrent planner misalignments, schema violations, brittle prompt dependencies, and more. Based on these insights, we refined both prompting and coding strategies, maintaining CUGA's benchmark results while enabling mid-sized models such as Llama 4 and Mistral Medium to achieve notable accuracy gains, substantially narrowing the gap with frontier models. Beyond quantitative validation, we conducted an exploratory study that fed the framework's diagnostic outputs and agent description into an LLM for self-reflection and prioritization. This interactive analysis produced actionable insights on recurring failure patterns and focus areas for improvement, demonstrating how validation itself can evolve into an agentic, dialogue-driven process. These results show a path toward scalable, quality assurance, and adaptive validation in production agentic systems, offering a foundation for more robust, interpretable, and self-improving agentic architectures.

Subjects:

Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

Cite as: arXiv:2603.29848 [cs.AI]

(or arXiv:2603.29848v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29848

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sergey Zeltyn Dr. [view email] [v1] Wed, 18 Feb 2026 14:55:35 UTC (744 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
AgentFixer:…llamamistralmodelbenchmarkannounceproductArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 152 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!