I Couldn’t Debug My AI/ML GPU Incident - So I Built My Own Tool
Hey there, little explorer! Imagine you have a super-duper toy robot that loves to build tall block towers all night.
One morning, you find the robot stopped! It ran out of special building blocks (that's like "GPU memory"). You want to know why it stopped, but you can only see what it's doing right now, not what happened last night.
So, a clever grown-up made a special "robot spy camera" called gpuxray! This camera watches every single block your robot uses, all the time.
Now, if the robot stops again, they can look at the spy camera video and see exactly which block made it stop! This helps them fix the robot so it can build even bigger towers! Yay!
<p>Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage.</p> <p>The next morning, I performed a root-cause analysis to understand what had happened the night before. However, I couldn’t identify the issue because I only had access to overall GPU usage metrics at current time. I used <code>nvidia-smi</code> and <code>nvtop</code> to inspect the current state, but there was no trace about the issue we got from last night. Therefore, I realized I needed a solution to prevent similar problems from happening in the future.</p> <p>I tried using <a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer">DCGM exporter</a> to expose GPU metrics, but <a href="
Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage.
The next morning, I performed a root-cause analysis to understand what had happened the night before. However, I couldn’t identify the issue because I only had access to overall GPU usage metrics at current time. I used nvidia-smi and nvtop to inspect the current state, but there was no trace about the issue we got from last night. Therefore, I realized I needed a solution to prevent similar problems from happening in the future.
I tried using DCGM exporter to expose GPU metrics, but it couldn’t provide PID-level metrics. I also tested it in a Kubernetes environment to get pod-level metrics, but it didn’t work because our GPUs only support time-slicing mode.
Therefore, I developed an open-source tool called gpuxray to monitor GPUs at the process level. gpuxray has helped our team significantly when observing and investigating bottlenecks in AI/ML processes running on Linux servers. It exposes metrics in Prometheus format, which we use to build Grafana dashboards for visualizing resource usage at the process level.
We deployed gpuxray in a Kubernetes cluster as a DaemonSet on all GPU nodes that need to be monitored.
> kubectl -n kube-operators get daemonset/gpuxray NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpuxray 2 2 2 2 2 node.k8s.cluster/gpu=exists 20d> kubectl -n kube-operators get daemonset/gpuxray NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpuxray 2 2 2 2 2 node.k8s.cluster/gpu=exists 20dEnter fullscreen mode
Exit fullscreen mode
With the setup described here, we can easily enable per-process GPU observability.
gpuxray achieves high performance while consuming minimal resources. Because it is built using eBPF to trace GPU memory-related events. This is powerful because eBPF allows us to observe what is happening inside the kernel based on specific use cases - in this case, we create probes these are attached to CUDA API. The project is built on a solid codebase, making it easy to extend in the future. If you have ideas, feel free to discuss or open a pull request.
Design and Architecture
Now, I will describe the architecture of gpuxray to help you understand how it works.
Basically, the userspace-code handles the main logic and is written in Go. The eBPF-program is attached to CUDA API calls. When these APIs are invoked, events are captured. The eBPF-program performs lightweight processing at the kernel level, updates eBPF maps, and sends events to the ring-buffer.The userspace-code then consumes events from the ring-buffer, processes them, and produces the final metrics output.
Performance and Resources Usage
With the mon option, gpuxray even no taken resources on the GPU server.
When tracing memory leaks using the memtrace option for a specific PID, I used a Python script to generate more than 2,000 malloc/free calls per second on the GPU and observed resource usage, gpuxray consumed only about ~8% of a single CPU core (on a server with 32 CPU cores and 125GB RAM).
This is impressive because ~2,000 malloc/free operations per second is not a typical real-world workload. As a result, we don’t need to worry about performance or resource overhead when using gpuxray.
Feel free to explore the project, try it out, and contribute your ideas: https://github.com/vuvietnguyenit/gpuxray
DEV Community
https://dev.to/vuvietnguyenit/i-couldnt-debug-my-aiml-gpu-incident-so-i-built-my-own-tool-33n9Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
availableupdateopen-source
My forays into cyborgism: theory, pt. 1
In this post, I share the thinking that lies behind the Exobrain system I have built for myself. In another post, I'll describe the actual system. I think the standard way of relating to LLM/AIs is as an external tool (or "digital mind") that you use and/or collaborate with. Instead of you doing the coding, you ask the LLM to do it for you. Instead of doing the research, you ask it to. That's great, and there is utility in those use cases. Now, while I hardly engage in the delusion that humans can have some kind of long-term symbiotic integration with AIs that prevents them from replacing us [1] , in the short term, I think humans can automate, outsource, and augment our thinking with LLM/AIs. We already augment our cognition with technologies such as writing and mundane software. Organizi

If LLMs Have No Memory, How Do They Remember Anything?
A technical but approachable guide to how large language models handle memory — from the math behind statelessness to the engineering behind systems that make AI feel like it actually knows you. An LLM is just a math function. A stateless one. Let’s start with the uncomfortable truth. At its core, a large language model — at inference time — is nothing more than a parameterized mathematical function. It takes an input, runs it through billions of learned parameters, and produces an output. Y = fθ(X) Here, X is your input (the prompt), θ (theta) represents all the learned weights baked into the model during training, and Y is the output — the response the model generates. Simple. But here’s the kicker: this function is stateless. What does “stateless” actually mean? Stateless means that whe

Building a Multi-Agent OS: Key Design Decisions That Matter
Introduction Most agent systems start at the top layer: a model, a persona, a tool list, and an orchestration wrapper. That works for demos. It does not hold up in production. State ends up split across conversations, approval logic hides inside prompts, and swapping a provider or runtime means rebuilding the loop. The useful questions sit lower in the stack: Which component owns task state? Which component enforces policy? Which surface lets operators inspect work and step in? Which event wakes execution? Which protocol must an executor follow to write results back? How do you project runtime capabilities into workspaces without drift? The essential part was discovering which boundaries actually matter. Across each iteration, the same correction kept showing up: centralize the truth, form
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!