Live
Black Hat USAAI BusinessBlack Hat AsiaAI Business$1,700 liquid-cooled phone can run GTA V at up to 100 FPS, Red Dead 2 at 50+ FPS via emulation — Redmagic 11 Pro packs 24 GB of RAM and pulls more than 40W at peak loadtomshardware.comGoogle's Head Of Learning Says AI Can't Solve Education's Real Problem - ForbesGNews AI GoogleApple approves drivers that let AMD and Nvidia eGPUs run on Mac — software designed for AI, though, and not built for gamingtomshardware.comShow HN: Vektor – local-first associative memory for AI agentsHacker News AI TopWorkers are feeling AI anxiety — and that they might be training their replacementsBusiness InsiderYour neighbor just got a home security system, but should you be worried? It s inherently a little creepy says surveillance expertFortune TechWe Asked A.I. to Build Us a Video Game. The Result Was Strange. - slate.comGoogle News: AISteam geeft mogelijk indicatie van framerate games op je specifieke hardwareTweakers.netStudy: The AI Body GapHacker News AI TopI’m Worried About the Helpless AI Disruptors of the Future - GizmodoGoogle News: AII m Worried About the Helpless AI Disruptors of the FutureGizmodoDid Strong Earnings and AI Cooling Momentum Just Shift Modine Manufacturing's (MOD) Investment Narrative? - simplywall.stGNews AI manufacturingBlack Hat USAAI BusinessBlack Hat AsiaAI Business$1,700 liquid-cooled phone can run GTA V at up to 100 FPS, Red Dead 2 at 50+ FPS via emulation — Redmagic 11 Pro packs 24 GB of RAM and pulls more than 40W at peak loadtomshardware.comGoogle's Head Of Learning Says AI Can't Solve Education's Real Problem - ForbesGNews AI GoogleApple approves drivers that let AMD and Nvidia eGPUs run on Mac — software designed for AI, though, and not built for gamingtomshardware.comShow HN: Vektor – local-first associative memory for AI agentsHacker News AI TopWorkers are feeling AI anxiety — and that they might be training their replacementsBusiness InsiderYour neighbor just got a home security system, but should you be worried? It s inherently a little creepy says surveillance expertFortune TechWe Asked A.I. to Build Us a Video Game. The Result Was Strange. - slate.comGoogle News: AISteam geeft mogelijk indicatie van framerate games op je specifieke hardwareTweakers.netStudy: The AI Body GapHacker News AI TopI’m Worried About the Helpless AI Disruptors of the Future - GizmodoGoogle News: AII m Worried About the Helpless AI Disruptors of the FutureGizmodoDid Strong Earnings and AI Cooling Momentum Just Shift Modine Manufacturing's (MOD) Investment Narrative? - simplywall.stGNews AI manufacturing
AI NEWS HUBbyEIGENVECTOREigenvector

Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.

Dev.to AIby George BelskyApril 5, 20266 min read0 views
Source Quiz

Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine. Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True . Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon. Why Container Health Checks Don't Work for Agents Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy." But responding to /healthz and processing work are two

Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine.

Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True.

Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon.

Why Container Health Checks Don't Work for Agents

Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy."

But responding to /healthz and processing work are two different things. An agent can:

  • Deadlock on an internal lock while still serving HTTP

  • OOM-kill its worker thread while the main thread stays alive

  • Enter an infinite retry loop on a broken downstream API

  • Silently drop into a except: pass branch and stop doing anything

The process is running. The container is green. The agent is useless.

Container health check: "Is the process alive?" YES What you actually need: "Is the agent doing work?" NO

Enter fullscreen mode

Exit fullscreen mode

This gap exists because container orchestration was designed for stateless web servers, not for long-running agents that hold state, maintain connections, and process work asynchronously.

The Heartbeat Pattern

The fix is old. Web services solved this 15 years ago with heartbeat monitoring. The idea is simple: the agent periodically reports "I am alive and working." If the report stops, something is wrong.

The difference between a health check and a heartbeat: health checks are passive (something pings you), heartbeats are active (you report out). A stuck agent can't respond to pings, but a stuck agent also can't send heartbeats. That's the point.

But building heartbeat infrastructure for agents means:

# 1. Heartbeat sender (added to every agent) import threading, time, requests

def heartbeat_loop(agent_id, interval=30): while True: try: requests.post( "https://monitoring.internal/heartbeat", json={"agent_id": agent_id, "ts": time.time()}, timeout=5, ) except Exception: pass time.sleep(interval)

threading.Thread(target=heartbeat_loop, args=("my-agent",), daemon=True).start()

2. Heartbeat checker (separate cron process)

3. Redis/Postgres for heartbeat storage

4. Alerting rules (Slack, PagerDuty)

5. Dashboard showing last-seen times

6. Logic to distinguish "stopped intentionally" from "crashed"

7. Cleanup for deregistered agents`

Enter fullscreen mode

Exit fullscreen mode

That's a monitoring system. For each agent framework you use, for each deployment environment, maintained forever.

One Line Instead

from axme import AxmeClient, AxmeClientConfig import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"])) client.mesh.start_heartbeat()`

Enter fullscreen mode

Exit fullscreen mode

That's it. A daemon thread wakes up every 30 seconds, sends a heartbeat to the platform, and goes back to sleep. When the agent stops - crash, deadlock, OOM, network partition - the heartbeats stop. The platform notices.

No Redis. No cron. No Prometheus. No webhook integrations. No alerting rules to maintain.

How Health Is Computed

The platform tracks the timestamp of each heartbeat and computes health automatically:

Time Since Last Heartbeat Status What It Means

< 90 seconds healthy Agent is alive and reporting

90 - 300 seconds degraded Agent may be stuck or overloaded

300 seconds unreachable Agent is down or not reporting

Manual kill killed Operator explicitly blocked this agent

The thresholds are designed around the 30-second default interval. A healthy agent with interval_seconds=30 sends a heartbeat every 30 seconds. If the platform hasn't heard from it in 90 seconds (3 missed heartbeats), something is probably wrong. If 5 minutes pass, it's gone.

The degraded state is the useful one. It's the early warning. The agent isn't dead yet, but it's missed a couple of beats. Maybe the event loop is under load. Maybe a GC pause ate 45 seconds. Maybe the network is flaky. You have a window to investigate before the agent goes fully unreachable.

What Happens When an Agent Goes Down

Here's the timeline with heartbeat monitoring:

00:00 Agent starts. Heartbeat begins. 00:30 Heartbeat sent. Status: healthy. 01:00 Heartbeat sent. Status: healthy. 01:15 Agent deadlocks on a database connection pool. 01:30 No heartbeat. (Agent is stuck, can't send.) 02:00 No heartbeat for 90s. Status: healthy -> degraded. 02:00 Platform logs state transition. 05:15 No heartbeat for 300s. Status: degraded -> unreachable. 05:15 Platform blocks new intent delivery to this agent.

Enter fullscreen mode

Exit fullscreen mode

Without heartbeat monitoring:

00:00 Agent starts. 01:15 Agent deadlocks. ... ... 03:15 Someone notices the task queue growing. 03:30 Engineer SSHs in. "The process is running." 03:45 "The container is green. Logs look... wait, no new logs since 1:15." 04:00 Engineer restarts the agent.

Enter fullscreen mode

Exit fullscreen mode

The difference: 2 minutes vs 2.75 hours. And the first scenario is automatic - no human needs to notice anything.

Heartbeat with Metrics

The heartbeat isn't just a ping. It can carry operational metrics, flushed automatically with each beat:

client.mesh.start_heartbeat(include_metrics=True)

As the agent processes work, report metrics

client.mesh.report_metric(success=True, latency_ms=234.5, cost_usd=0.003) client.mesh.report_metric(success=False, latency_ms=5012.0)

Metrics are buffered in memory and sent with the next heartbeat

No separate metrics pipeline needed`

Enter fullscreen mode

Exit fullscreen mode

Every 30 seconds, the heartbeat sends both "I'm alive" and "here's how I'm doing" - success rate, average latency, cost accumulation. The platform aggregates per agent and exposes it through the CLI and dashboard.

This turns the heartbeat from a binary alive/dead signal into a continuous health signal. An agent that's alive but processing tasks at 20x normal latency shows up before it becomes a problem.

Kill and Resume

Sometimes an agent needs to be stopped. Not crashed - intentionally blocked. Maybe it's misbehaving. Maybe you're doing maintenance. Maybe it's burning through your API budget.

# From code (address_id from list_agents) client.mesh.kill("addr_abc123")

Enter fullscreen mode

Exit fullscreen mode

A killed agent enters the killed state. Even if its heartbeat thread is still running, the gateway keeps it killed. No intents are delivered. It stays killed until explicitly resumed:

client.mesh.resume("addr_abc123")

Enter fullscreen mode

Exit fullscreen mode

Or kill/resume from the dashboard at mesh.axme.ai with one click.

This is different from the agent crashing. A crash leads to unreachable. A kill is deliberate. The distinction matters for alerting - you don't want to page on-call for an agent you intentionally stopped.

Fleet Visibility

When you have 20 agents across 4 machines, the dashboard matters more than any individual heartbeat.

The AXME Mesh Dashboard at mesh.axme.ai shows complete fleet health in real time:

Open it with:

axme mesh dashboard report-generator killed (manual)

Summary: 2 healthy, 1 degraded, 1 unreachable, 1 killed`

Enter fullscreen mode

Exit fullscreen mode

One command. Complete fleet health. No SSH. No Grafana. No log aggregation pipeline.

The Real Cost of Silent Failures

Every team running agents at scale has the same story. An agent went down on Friday afternoon. Nobody noticed until Monday morning. 60 hours of missed processing. Customer complaints. Backlog that took another 8 hours to clear.

The fix isn't complicated. It's one function call. The hard part is remembering that containers passing health checks is not the same as agents doing work.

client.mesh.start_heartbeat()

Enter fullscreen mode

Exit fullscreen mode

That's the whole fix.

Try It

Working example - start an agent with heartbeat, kill the process, watch the status transition from healthy to degraded to unreachable:

github.com/AxmeAI/ai-agent-heartbeat-monitoring

Built with AXME - heartbeat, health detection, and fleet monitoring for AI agents. Alpha - feedback welcome.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Your AI Age…platformserviceintegrationreportagentgithubDev.to AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 146 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!