Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.
Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine. Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True . Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon. Why Container Health Checks Don't Work for Agents Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy." But responding to /healthz and processing work are two
Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine.
Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True.
Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon.
Why Container Health Checks Don't Work for Agents
Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy."
But responding to /healthz and processing work are two different things. An agent can:
-
Deadlock on an internal lock while still serving HTTP
-
OOM-kill its worker thread while the main thread stays alive
-
Enter an infinite retry loop on a broken downstream API
-
Silently drop into a except: pass branch and stop doing anything
The process is running. The container is green. The agent is useless.
Container health check: "Is the process alive?" YES What you actually need: "Is the agent doing work?" NOContainer health check: "Is the process alive?" YES What you actually need: "Is the agent doing work?" NOEnter fullscreen mode
Exit fullscreen mode
This gap exists because container orchestration was designed for stateless web servers, not for long-running agents that hold state, maintain connections, and process work asynchronously.
The Heartbeat Pattern
The fix is old. Web services solved this 15 years ago with heartbeat monitoring. The idea is simple: the agent periodically reports "I am alive and working." If the report stops, something is wrong.
The difference between a health check and a heartbeat: health checks are passive (something pings you), heartbeats are active (you report out). A stuck agent can't respond to pings, but a stuck agent also can't send heartbeats. That's the point.
But building heartbeat infrastructure for agents means:
# 1. Heartbeat sender (added to every agent) import threading, time, requests# 1. Heartbeat sender (added to every agent) import threading, time, requestsdef heartbeat_loop(agent_id, interval=30): while True: try: requests.post( "https://monitoring.internal/heartbeat", json={"agent_id": agent_id, "ts": time.time()}, timeout=5, ) except Exception: pass time.sleep(interval)
threading.Thread(target=heartbeat_loop, args=("my-agent",), daemon=True).start()
2. Heartbeat checker (separate cron process)
3. Redis/Postgres for heartbeat storage
4. Alerting rules (Slack, PagerDuty)
5. Dashboard showing last-seen times
6. Logic to distinguish "stopped intentionally" from "crashed"
7. Cleanup for deregistered agents`
Enter fullscreen mode
Exit fullscreen mode
That's a monitoring system. For each agent framework you use, for each deployment environment, maintained forever.
One Line Instead
from axme import AxmeClient, AxmeClientConfig import osfrom axme import AxmeClient, AxmeClientConfig import osclient = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"])) client.mesh.start_heartbeat()`
Enter fullscreen mode
Exit fullscreen mode
That's it. A daemon thread wakes up every 30 seconds, sends a heartbeat to the platform, and goes back to sleep. When the agent stops - crash, deadlock, OOM, network partition - the heartbeats stop. The platform notices.
No Redis. No cron. No Prometheus. No webhook integrations. No alerting rules to maintain.
How Health Is Computed
The platform tracks the timestamp of each heartbeat and computes health automatically:
Time Since Last Heartbeat Status What It Means
< 90 seconds healthy Agent is alive and reporting
90 - 300 seconds degraded Agent may be stuck or overloaded
300 seconds unreachable Agent is down or not reporting
Manual kill killed Operator explicitly blocked this agent
The thresholds are designed around the 30-second default interval. A healthy agent with interval_seconds=30 sends a heartbeat every 30 seconds. If the platform hasn't heard from it in 90 seconds (3 missed heartbeats), something is probably wrong. If 5 minutes pass, it's gone.
The degraded state is the useful one. It's the early warning. The agent isn't dead yet, but it's missed a couple of beats. Maybe the event loop is under load. Maybe a GC pause ate 45 seconds. Maybe the network is flaky. You have a window to investigate before the agent goes fully unreachable.
What Happens When an Agent Goes Down
Here's the timeline with heartbeat monitoring:
00:00 Agent starts. Heartbeat begins. 00:30 Heartbeat sent. Status: healthy. 01:00 Heartbeat sent. Status: healthy. 01:15 Agent deadlocks on a database connection pool. 01:30 No heartbeat. (Agent is stuck, can't send.) 02:00 No heartbeat for 90s. Status: healthy -> degraded. 02:00 Platform logs state transition. 05:15 No heartbeat for 300s. Status: degraded -> unreachable. 05:15 Platform blocks new intent delivery to this agent.00:00 Agent starts. Heartbeat begins. 00:30 Heartbeat sent. Status: healthy. 01:00 Heartbeat sent. Status: healthy. 01:15 Agent deadlocks on a database connection pool. 01:30 No heartbeat. (Agent is stuck, can't send.) 02:00 No heartbeat for 90s. Status: healthy -> degraded. 02:00 Platform logs state transition. 05:15 No heartbeat for 300s. Status: degraded -> unreachable. 05:15 Platform blocks new intent delivery to this agent.Enter fullscreen mode
Exit fullscreen mode
Without heartbeat monitoring:
00:00 Agent starts. 01:15 Agent deadlocks. ... ... 03:15 Someone notices the task queue growing. 03:30 Engineer SSHs in. "The process is running." 03:45 "The container is green. Logs look... wait, no new logs since 1:15." 04:00 Engineer restarts the agent.00:00 Agent starts. 01:15 Agent deadlocks. ... ... 03:15 Someone notices the task queue growing. 03:30 Engineer SSHs in. "The process is running." 03:45 "The container is green. Logs look... wait, no new logs since 1:15." 04:00 Engineer restarts the agent.Enter fullscreen mode
Exit fullscreen mode
The difference: 2 minutes vs 2.75 hours. And the first scenario is automatic - no human needs to notice anything.
Heartbeat with Metrics
The heartbeat isn't just a ping. It can carry operational metrics, flushed automatically with each beat:
client.mesh.start_heartbeat(include_metrics=True)
As the agent processes work, report metrics
client.mesh.report_metric(success=True, latency_ms=234.5, cost_usd=0.003) client.mesh.report_metric(success=False, latency_ms=5012.0)
Metrics are buffered in memory and sent with the next heartbeat
No separate metrics pipeline needed`
Enter fullscreen mode
Exit fullscreen mode
Every 30 seconds, the heartbeat sends both "I'm alive" and "here's how I'm doing" - success rate, average latency, cost accumulation. The platform aggregates per agent and exposes it through the CLI and dashboard.
This turns the heartbeat from a binary alive/dead signal into a continuous health signal. An agent that's alive but processing tasks at 20x normal latency shows up before it becomes a problem.
Kill and Resume
Sometimes an agent needs to be stopped. Not crashed - intentionally blocked. Maybe it's misbehaving. Maybe you're doing maintenance. Maybe it's burning through your API budget.
# From code (address_id from list_agents) client.mesh.kill("addr_abc123")# From code (address_id from list_agents) client.mesh.kill("addr_abc123")Enter fullscreen mode
Exit fullscreen mode
A killed agent enters the killed state. Even if its heartbeat thread is still running, the gateway keeps it killed. No intents are delivered. It stays killed until explicitly resumed:
client.mesh.resume("addr_abc123")
Enter fullscreen mode
Exit fullscreen mode
Or kill/resume from the dashboard at mesh.axme.ai with one click.
This is different from the agent crashing. A crash leads to unreachable. A kill is deliberate. The distinction matters for alerting - you don't want to page on-call for an agent you intentionally stopped.
Fleet Visibility
When you have 20 agents across 4 machines, the dashboard matters more than any individual heartbeat.
The AXME Mesh Dashboard at mesh.axme.ai shows complete fleet health in real time:
Open it with:
axme mesh dashboard report-generator killed (manual)axme mesh dashboard report-generator killed (manual)Summary: 2 healthy, 1 degraded, 1 unreachable, 1 killed`
Enter fullscreen mode
Exit fullscreen mode
One command. Complete fleet health. No SSH. No Grafana. No log aggregation pipeline.
The Real Cost of Silent Failures
Every team running agents at scale has the same story. An agent went down on Friday afternoon. Nobody noticed until Monday morning. 60 hours of missed processing. Customer complaints. Backlog that took another 8 hours to clear.
The fix isn't complicated. It's one function call. The hard part is remembering that containers passing health checks is not the same as agents doing work.
client.mesh.start_heartbeat()
Enter fullscreen mode
Exit fullscreen mode
That's the whole fix.
Try It
Working example - start an agent with heartbeat, kill the process, watch the status transition from healthy to degraded to unreachable:
github.com/AxmeAI/ai-agent-heartbeat-monitoring
Built with AXME - heartbeat, health detection, and fleet monitoring for AI agents. Alpha - feedback welcome.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
platformserviceintegrationtrunk/02711cea997665055814b95264c4dcd8fb605a0a: [torchtitan hash update] update the pinned torchtitan hash (#178727)
This PR is auto-generated nightly by this action . Update the pinned torchtitan hash. Pull Request resolved: #178727 Approved by: https://github.com/pytorchbot
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!