[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.
I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, not success rates on 10 tries. Actual production metrics on real hardware. I couldn't find honest numbers anywhere, so I built a benchmark. Setup: DROID platform, bin-to-bin order picking – one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running). We measure Units Per Hour (UPH) and Mean Time Between Failures (MTBF) – the metrics operations people actually use. Results (full data with video and telemetry for every run at phail.ai ): Model UPH MTBF OpenPI (pi0.5) 65 4.0 min GR00T 60 3.5 min ACT 44 2.8 min SmolVLA 18 1.2 min Teleop / Fi
Could not retrieve the full article text.
Read on Reddit r/MachineLearning →Reddit r/MachineLearning
https://www.reddit.com/r/MachineLearning/comments/1sajdwr/p_phail_phailai_an_open_benchmark_for_robot_ai_on/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelbenchmarktraining
Highlights from my conversation about agentic engineering on Lenny's Podcast
I was a guest on Lenny Rachitsky's podcast, in a new episode titled An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines . It's available on YouTube , Spotify , and Apple Podcasts . Here are my highlights from our conversation, with relevant links. The November inflection point Software engineers as bellwethers for other information workers Writing code on my phone Responsible vibe coding Dark Factories and StrongDM The bottleneck has moved to testing This stuff is exhausting Interruptions cost a lot less now My ability to estimate software is broken It's tough for people in the middle It's harder to evaluate software The misconception that AI tools are easy Coding agents are useful for security research now OpenClaw Journalists a
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.

![[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-neural-network-P6fqXULWLNUwjuxqUZnB3T.webp)

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!