Near-Miss: Latent Policy Failure Detection in Agentic Workflows
arXiv:2603.29665v1 Announce Type: new Abstract: Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural
View PDF HTML (experimental)
Abstract:Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed. We evaluate our approach on the $\tau^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2603.29665 [cs.CL]
(or arXiv:2603.29665v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.29665
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Ella Rabinovich [view email] [v1] Tue, 31 Mar 2026 12:26:35 UTC (5,176 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
benchmarkannounceupdate
Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks - Editorji
Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks Editorji

Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks - Editorji
Smallest.ai launches Lightning V3, a new text-to-speech model that beats OpenAI, Cartesia, and ElevenLabs on key voice quality benchmarks Editorji




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!