Research Papers research paper arxiv ai artificial-intelligence

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

arXivMarch 31, 202610 min read0 views

arXiv:2603.28569v1 Announce Type: cross Abstract: The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency e — Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo Ma, Weiting Liu, Jianfeng Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu

Authors:Yi Yu, Guangquan Hu, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Junzhuo Ma, Weiting Liu, Jianfeng Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu

View PDF HTML (experimental)

Abstract:The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: this https URL

Comments: Submitted for SIGKDD 2026

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Performance (cs.PF)

Cite as: arXiv:2603.28569 [cs.LG]

(or arXiv:2603.28569v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.28569

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yu Yi [view email] [v1] Mon, 30 Mar 2026 15:26:00 UTC (1,216 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28569

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Models

Google DeepMind’s Eli Collins to Headline IMPACT: The Data Observability Summit on November 8

Collins will discuss DeepMind’s latest research, the future of LLMs, and how to deploy AI responsibly.

montecarlodata.com

1mover 2 years ago

Research Papers

Philipp Müller starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m5 months ago

Research Papers

We are hiring a new Max Planck Research Group Leader at the MPI for Intelligent Systems in Stuttgart

is.mpg.de

1m4 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 73 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Philipp Müller starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m5 months ago

Research Papers

We are hiring a new Max Planck Research Group Leader at the MPI for Intelligent Systems in Stuttgart

is.mpg.de

1m4 months ago

Research Papers

More room for world class research

is.mpg.de

1m5 months ago

Research Papers

Telia agrees Swedish sovereign AI deal with Brookfield - Telecompaper

<a href="https://news.google.com/rss/articles/CBMingFBVV95cUxQY1ZCaEFJUVJLNFJUOWoyLVBqVGxCdjQ1QUJ6WEdPdVFvU0ZMVnZpZG9IY1YxaFlFOXhqME1lRXBWd2x5Tjg2bDdnaWlzQUxwQkZPWG1KU1RwN25BelRhREJyTXEwZWI2Vk9nTTlLdnI1RDFhQnpWa3hpa1ZwTHc1cGNNVmVtckFianM2YlNVZXJFZ3U2X2NmMl9BcUN4QQ?oc=5" target="_blank">Telia agrees Swedish sovereign AI deal with Brookfield</a> <font color="#6f6f6f">Telecompaper</font>

Google News AI Sweden

1m15 days ago

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

Submission history

Daily AI Digest

More about

Google DeepMind&#8217;s Eli Collins to Headline IMPACT: The Data Observability Summit on November 8

Philipp Müller starts as Cyber Valley Max Planck Independent Research Group Leader

We are hiring a new Max Planck Research Group Leader at the MPI for Intelligent Systems in Stuttgart

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Research Papers

Philipp Müller starts as Cyber Valley Max Planck Independent Research Group Leader

We are hiring a new Max Planck Research Group Leader at the MPI for Intelligent Systems in Stuttgart

More room for world class research

Telia agrees Swedish sovereign AI deal with Brookfield - Telecompaper

Google DeepMind’s Eli Collins to Headline IMPACT: The Data Observability Summit on November 8