Products platform service integration report agent github

Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.

Dev.to AIby George BelskyApril 5, 20266 min read0 views

Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine. Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True . Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon. Why Container Health Checks Don't Work for Agents Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy." But responding to /healthz and processing work are two

Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine.

Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True.

Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon.

Why Container Health Checks Don't Work for Agents

Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy."

But responding to /healthz and processing work are two different things. An agent can:

Deadlock on an internal lock while still serving HTTP
OOM-kill its worker thread while the main thread stays alive
Enter an infinite retry loop on a broken downstream API
Silently drop into a except: pass branch and stop doing anything

The process is running. The container is green. The agent is useless.

Container health check: "Is the process alive?" YES What you actually need: "Is the agent doing work?" NO

Container health check: "Is the process alive?" YES What you actually need: "Is the agent doing work?" NO

Enter fullscreen mode

Exit fullscreen mode

This gap exists because container orchestration was designed for stateless web servers, not for long-running agents that hold state, maintain connections, and process work asynchronously.

The Heartbeat Pattern

The fix is old. Web services solved this 15 years ago with heartbeat monitoring. The idea is simple: the agent periodically reports "I am alive and working." If the report stops, something is wrong.

The difference between a health check and a heartbeat: health checks are passive (something pings you), heartbeats are active (you report out). A stuck agent can't respond to pings, but a stuck agent also can't send heartbeats. That's the point.

But building heartbeat infrastructure for agents means:

# 1. Heartbeat sender (added to every agent) import threading, time, requests

# 1. Heartbeat sender (added to every agent) import threading, time, requests

def heartbeat_loop(agent_id, interval=30): while True: try: requests.post( "https://monitoring.internal/heartbeat", json={"agent_id": agent_id, "ts": time.time()}, timeout=5, ) except Exception: pass time.sleep(interval)

threading.Thread(target=heartbeat_loop, args=("my-agent",), daemon=True).start()

2. Heartbeat checker (separate cron process)

3. Redis/Postgres for heartbeat storage

4. Alerting rules (Slack, PagerDuty)

5. Dashboard showing last-seen times

6. Logic to distinguish "stopped intentionally" from "crashed"

7. Cleanup for deregistered agents`

Enter fullscreen mode

Exit fullscreen mode

That's a monitoring system. For each agent framework you use, for each deployment environment, maintained forever.

One Line Instead

from axme import AxmeClient, AxmeClientConfig import os

from axme import AxmeClient, AxmeClientConfig import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"])) client.mesh.start_heartbeat()`

Enter fullscreen mode

Exit fullscreen mode

That's it. A daemon thread wakes up every 30 seconds, sends a heartbeat to the platform, and goes back to sleep. When the agent stops - crash, deadlock, OOM, network partition - the heartbeats stop. The platform notices.

No Redis. No cron. No Prometheus. No webhook integrations. No alerting rules to maintain.

How Health Is Computed

The platform tracks the timestamp of each heartbeat and computes health automatically:

Time Since Last Heartbeat Status What It Means

< 90 seconds healthy Agent is alive and reporting

90 - 300 seconds degraded Agent may be stuck or overloaded

300 seconds unreachable Agent is down or not reporting

Manual kill killed Operator explicitly blocked this agent

The thresholds are designed around the 30-second default interval. A healthy agent with interval_seconds=30 sends a heartbeat every 30 seconds. If the platform hasn't heard from it in 90 seconds (3 missed heartbeats), something is probably wrong. If 5 minutes pass, it's gone.

The degraded state is the useful one. It's the early warning. The agent isn't dead yet, but it's missed a couple of beats. Maybe the event loop is under load. Maybe a GC pause ate 45 seconds. Maybe the network is flaky. You have a window to investigate before the agent goes fully unreachable.

What Happens When an Agent Goes Down

Here's the timeline with heartbeat monitoring:

00:00 Agent starts. Heartbeat begins. 00:30 Heartbeat sent. Status: healthy. 01:00 Heartbeat sent. Status: healthy. 01:15 Agent deadlocks on a database connection pool. 01:30 No heartbeat. (Agent is stuck, can't send.) 02:00 No heartbeat for 90s. Status: healthy -> degraded. 02:00 Platform logs state transition. 05:15 No heartbeat for 300s. Status: degraded -> unreachable. 05:15 Platform blocks new intent delivery to this agent.

00:00 Agent starts. Heartbeat begins. 00:30 Heartbeat sent. Status: healthy. 01:00 Heartbeat sent. Status: healthy. 01:15 Agent deadlocks on a database connection pool. 01:30 No heartbeat. (Agent is stuck, can't send.) 02:00 No heartbeat for 90s. Status: healthy -> degraded. 02:00 Platform logs state transition. 05:15 No heartbeat for 300s. Status: degraded -> unreachable. 05:15 Platform blocks new intent delivery to this agent.

Enter fullscreen mode

Exit fullscreen mode

Without heartbeat monitoring:

00:00 Agent starts. 01:15 Agent deadlocks. ... ... 03:15 Someone notices the task queue growing. 03:30 Engineer SSHs in. "The process is running." 03:45 "The container is green. Logs look... wait, no new logs since 1:15." 04:00 Engineer restarts the agent.

00:00 Agent starts. 01:15 Agent deadlocks. ... ... 03:15 Someone notices the task queue growing. 03:30 Engineer SSHs in. "The process is running." 03:45 "The container is green. Logs look... wait, no new logs since 1:15." 04:00 Engineer restarts the agent.

Enter fullscreen mode

Exit fullscreen mode

The difference: 2 minutes vs 2.75 hours. And the first scenario is automatic - no human needs to notice anything.

Heartbeat with Metrics

The heartbeat isn't just a ping. It can carry operational metrics, flushed automatically with each beat:

client.mesh.start_heartbeat(include_metrics=True)

As the agent processes work, report metrics

client.mesh.report_metric(success=True, latency_ms=234.5, cost_usd=0.003) client.mesh.report_metric(success=False, latency_ms=5012.0)

Metrics are buffered in memory and sent with the next heartbeat

No separate metrics pipeline needed`

Enter fullscreen mode

Exit fullscreen mode

Every 30 seconds, the heartbeat sends both "I'm alive" and "here's how I'm doing" - success rate, average latency, cost accumulation. The platform aggregates per agent and exposes it through the CLI and dashboard.

This turns the heartbeat from a binary alive/dead signal into a continuous health signal. An agent that's alive but processing tasks at 20x normal latency shows up before it becomes a problem.

Kill and Resume

Sometimes an agent needs to be stopped. Not crashed - intentionally blocked. Maybe it's misbehaving. Maybe you're doing maintenance. Maybe it's burning through your API budget.

# From code (address_id from list_agents) client.mesh.kill("addr_abc123")

# From code (address_id from list_agents) client.mesh.kill("addr_abc123")

Enter fullscreen mode

Exit fullscreen mode

A killed agent enters the killed state. Even if its heartbeat thread is still running, the gateway keeps it killed. No intents are delivered. It stays killed until explicitly resumed:

client.mesh.resume("addr_abc123")

Enter fullscreen mode

Exit fullscreen mode

Or kill/resume from the dashboard at mesh.axme.ai with one click.

This is different from the agent crashing. A crash leads to unreachable. A kill is deliberate. The distinction matters for alerting - you don't want to page on-call for an agent you intentionally stopped.

Fleet Visibility

When you have 20 agents across 4 machines, the dashboard matters more than any individual heartbeat.

The AXME Mesh Dashboard at mesh.axme.ai shows complete fleet health in real time:

Open it with:

axme mesh dashboard report-generator killed (manual)

axme mesh dashboard report-generator killed (manual)

Summary: 2 healthy, 1 degraded, 1 unreachable, 1 killed`

Enter fullscreen mode

Exit fullscreen mode

One command. Complete fleet health. No SSH. No Grafana. No log aggregation pipeline.

The Real Cost of Silent Failures

Every team running agents at scale has the same story. An agent went down on Friday afternoon. Nobody noticed until Monday morning. 60 hours of missed processing. Customer complaints. Backlog that took another 8 hours to clear.

The fix isn't complicated. It's one function call. The hard part is remembering that containers passing health checks is not the same as agents doing work.

client.mesh.start_heartbeat()

Enter fullscreen mode

Exit fullscreen mode

That's the whole fix.

Try It

Working example - start an agent with heartbeat, kill the process, watch the status transition from healthy to degraded to unreachable:

github.com/AxmeAI/ai-agent-heartbeat-monitoring

Built with AXME - heartbeat, health detection, and fleet monitoring for AI agents. Alpha - feedback welcome.

Original source

Dev.to AI

https://dev.to/george_belsky_a513cfbf3df/your-ai-agent-stopped-responding-2-hours-ago-nobody-noticed-5340

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

platformserviceintegration

ReleasesFresh

trunk/159ed30b9d44989b1e166252e591b300f0ef315e: [audio hash update] update the pinned audio hash (#178951)

This PR is auto-generated nightly by this action . Update the pinned audio hash. Pull Request resolved: #178951 Approved by: https://github.com/pytorchbot

PyTorch Releases

1mabout 6 hours ago

ProductsFresh

trunk/02031f23bd615ab08addcce64280c7585f390de8: Accept GraphModule in create_compiler_config_extra (#177852)

Pull Request resolved: #177852 Approved by: https://github.com/zou3519

PyTorch Releases

1mabout 9 hours ago

ReleasesFresh

trunk/02711cea997665055814b95264c4dcd8fb605a0a: [torchtitan hash update] update the pinned torchtitan hash (#178727)

This PR is auto-generated nightly by this action . Update the pinned torchtitan hash. Pull Request resolved: #178727 Approved by: https://github.com/pytorchbot

PyTorch Releases

1mabout 6 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 146 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.

Why Container Health Checks Don't Work for Agents

The Heartbeat Pattern

2. Heartbeat checker (separate cron process)

3. Redis/Postgres for heartbeat storage

4. Alerting rules (Slack, PagerDuty)

5. Dashboard showing last-seen times

6. Logic to distinguish "stopped intentionally" from "crashed"

7. Cleanup for deregistered agents`

One Line Instead

How Health Is Computed

What Happens When an Agent Goes Down

Heartbeat with Metrics

As the agent processes work, report metrics

Metrics are buffered in memory and sent with the next heartbeat

No separate metrics pipeline needed`

Kill and Resume

Fleet Visibility

The Real Cost of Silent Failures

Try It

Daily AI Digest

More about

trunk/159ed30b9d44989b1e166252e591b300f0ef315e: [audio hash update] update the pinned audio hash (#178951)

trunk/02031f23bd615ab08addcce64280c7585f390de8: Accept GraphModule in create_compiler_config_extra (#177852)

trunk/02711cea997665055814b95264c4dcd8fb605a0a: [torchtitan hash update] update the pinned torchtitan hash (#178727)

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Products

Baidu’s AI Assistant Reaches Milestone of 200 Million Monthly Active Users - WSJ

trunk/02031f23bd615ab08addcce64280c7585f390de8: Accept GraphModule in create_compiler_config_extra (#177852)

Amazon Bedrock Guardrails supports cross-account safeguards with centralized control and management - Amazon Web Services

Apple approves drivers that let AMD and Nvidia eGPUs run on Mac — software designed for AI, though, and not built for gaming