Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechIntel to Report First-Quarter 2026 Financial Resultsnewsroom.intel.comMeta’s Court Losses Put AI Governance Under New Pressure - The National CIO ReviewGNews AI MetaCompanies bet on agentic SOC as AI reshapes security - SiliconANGLEGNews AI IBMStop Searching, Start Contributing: How GoodFirstGo is Making Open Source ApproachableDEV CommunityBest-Selling AI SEO Book “AI SEO 2026” Now Available for Business Owners and Personal Brands Seeking to Be Found by AI Search - StreetInsiderGNews AI searchMicrosoft closes worst quarter on Wall Street since 2008 on AI concerns: 'Redmond is in a pickle' - CNBCGNews AI CopilotCalifornia Tightens AI Contract Rules as Fight With Trump Admin Grows - YahooGNews AI regulationCalifornia Tightens AI Contract Rules as Fight With Trump Admin GrowsDecrypt AIBuilding a LEGO-like remote Agent - Jean2DEV CommunityStudents Renting Smart Glasses to Cheat on TestsFuturism AIWhat's next after bitcoin's historic underperformance stretch against stocksCoinDesk AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechIntel to Report First-Quarter 2026 Financial Resultsnewsroom.intel.comMeta’s Court Losses Put AI Governance Under New Pressure - The National CIO ReviewGNews AI MetaCompanies bet on agentic SOC as AI reshapes security - SiliconANGLEGNews AI IBMStop Searching, Start Contributing: How GoodFirstGo is Making Open Source ApproachableDEV CommunityBest-Selling AI SEO Book “AI SEO 2026” Now Available for Business Owners and Personal Brands Seeking to Be Found by AI Search - StreetInsiderGNews AI searchMicrosoft closes worst quarter on Wall Street since 2008 on AI concerns: 'Redmond is in a pickle' - CNBCGNews AI CopilotCalifornia Tightens AI Contract Rules as Fight With Trump Admin Grows - YahooGNews AI regulationCalifornia Tightens AI Contract Rules as Fight With Trump Admin GrowsDecrypt AIBuilding a LEGO-like remote Agent - Jean2DEV CommunityStudents Renting Smart Glasses to Cheat on TestsFuturism AIWhat's next after bitcoin's historic underperformance stretch against stocksCoinDesk AI

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28569v1 Announce Type: cross Abstract: The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency e — Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo Ma, Weiting Liu, Jianfeng Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu

Authors:Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo Ma, Weiting Liu, Jianfeng Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu

View PDF HTML (experimental)

Abstract:The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: this https URL

Comments: Submitted for SIGKDD 2026

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Performance (cs.PF)

Cite as: arXiv:2603.28569 [cs.LG]

(or arXiv:2603.28569v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.28569

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yu Yi [view email] [v1] Mon, 30 Mar 2026 15:26:00 UTC (1,216 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
CirrusBench…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 73 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!