Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild
arXiv:2603.29020v1 Announce Type: new Abstract: Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task-framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and re
View PDF HTML (experimental)
Abstract:Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task-framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and reporting. Emergence WebVoyager achieves an inter-annotator agreement of 95.9%, indicating improved clarity and reliability in both task formulation and evaluation. Applying this framework to evaluate OpenAI Operator reveals substantial performance variation across domains and task types, with an overall success rate of 68.6%, substantially lower than the 87% previously reported by OpenAI, demonstrating the utility of our approach for more rigorous and comparable web agent evaluation.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.29020 [cs.AI]
(or arXiv:2603.29020v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.29020
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Mowafak Allaham [view email] [v1] Mon, 30 Mar 2026 21:27:28 UTC (377 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
benchmarkannounceversionHDF5 vs. TsFile: Efficient Time-Series Data Storage
<p>In the era of big data, efficient data storage and management are critical to the success of both scientific research and industrial applications. <a href="https://www.hdfgroup.org/solutions/hdf5/" rel="noopener noreferrer">HDF5</a>, a hierarchical format for managing experimental data, and <a href="https://tsfile.apache.org" rel="noopener noreferrer">TsFile</a>, a modern time-series data storage format, each offer unique strengths and design philosophies. This article takes a deep dive into the origins, use cases, and limitations of HDF5, and explores the similarities and differences between HDF5 and TsFile.</p> <h2> Origins of HDF5 </h2> <p>HDF5, short for <em>Hierarchical Data Format version 5</em>, is more than just a file format. It encompasses a full data model, software libraries
There Is No Such Thing As a Service
<p>If you have been following this series, you know I am a fan of services. Dependency injection, single responsibility, clean boundaries between concerns. Done right, you end up with hundreds of services, each doing exactly one thing.</p> <p>But here is the problem nobody talks about: the word "service" does not actually mean anything.</p> <p><code>ItemService</code>. What does it do? Everything. What is inside? Who knows. You have to open it and start reading. And the more your codebase grows, the more that class becomes a dumping ground, a god class disguised by a reasonable name.</p> <p>I want to argue that the service as we know it is just one of many distinct types of classes we could be writing. And the moment you start thinking in terms of those types, your code becomes something a
Step‑by‑Step Guide: Generate PowerPoint Slides Using Copilot Studio Agent
<h2> Introduction </h2> <p>Microsoft Copilot Studio allows you to create AI agents that automate tasks, including generating PowerPoint presentations. This guide walks you through creating a Copilot Studio agent that generates PowerPoint (PPT) slides automatically based on user input.</p> <h1> Prerequisites </h1> <p>Before you begin, ensure you have:</p> <ul> <li>Microsoft 365 account</li> <li>Access to Microsoft Copilot Studio</li> <li>Power Automate access</li> <li>SharePoint or OneDrive access (for storing generated PPT files)</li> <li>PowerPoint Online access</li> </ul> <h1> Step 1: Access Microsoft Copilot Studio </h1> <ol> <li>Go to Microsoft Copilot Studio</li> <li>Sign in with your Microsoft 365 account</li> <li>Click <strong>Create</strong> or <strong>New Copilot</strong> </li> <l
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Releases
There Is No Such Thing As a Service
<p>If you have been following this series, you know I am a fan of services. Dependency injection, single responsibility, clean boundaries between concerns. Done right, you end up with hundreds of services, each doing exactly one thing.</p> <p>But here is the problem nobody talks about: the word "service" does not actually mean anything.</p> <p><code>ItemService</code>. What does it do? Everything. What is inside? Who knows. You have to open it and start reading. And the more your codebase grows, the more that class becomes a dumping ground, a god class disguised by a reasonable name.</p> <p>I want to argue that the service as we know it is just one of many distinct types of classes we could be writing. And the moment you start thinking in terms of those types, your code becomes something a
ASUS Announces UGen300 USB AI Accelerator - ASUS Pressroom
<a href="https://news.google.com/rss/articles/CBMimgFBVV95cUxNMXh1czREZEtTeTBnZ2RRRGVwSGRQV0xXejFJSGNJNU5IYk81M0hUTE8xSmZZREJRRWhmZFdHb1otZk9CM3IxZmlFMEloOXgxTlJZM1lvTDNNNTFvNXNKV2RZQ01PbkxIWlhWMS1HNHF5SmlmdWZ0WTlhbUYydjdDeTJGN3hacUF2TDJLaGJWOTYzR29zTENuSEpR?oc=5" target="_blank">ASUS Announces UGen300 USB AI Accelerator</a> <font color="#6f6f6f">ASUS Pressroom</font>

HarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness Engineering
<p>There's a concept gaining traction in AI systems engineering: <strong>Harness Engineering</strong>.</p> <p>Not the testing tool. The idea: raw LLM capability is like raw power — high voltage, hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of building the control structures that make that power <em>usable at scale</em>.<br> Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers.</p> <p>I think it's going to be one of the defining disciplines of serious AI systems work.<br> And I've been building a platform around it.</p> <h2> What I Built </h2> <p><strong>HarnessOS</strong> is a scaffold/middleware system for running infinite autonomous tasks.</p> <p>The key word is <em>infinite</em>. Not one task. Not one session. An a

NH:STA S01E02 OpenPGP.js
<p>This post is part of a series on our work for the <a href="https://www.sovereign.tech/" rel="noopener noreferrer">Sovereign Tech Agency</a> (STA). Our first post in the series explains why and how we are contributing to various open source projects. </p> <h2> About the project </h2> <p><a href="https://openpgpjs.org" rel="noopener noreferrer">OpenPGP.js</a> is a pure, Open Source <a href="https://en.wikipedia.org/wiki/Pretty_Good_Privacy#OpenPGP" rel="noopener noreferrer">OpenPGP</a> implementation written in JavaScript. Its main use-case is enabling PGP workflows in web-based email systems, but as JavaScript is available on almost all devices these days, its utility is universal.</p> <h2> Our contributions </h2> <p>We started out by <strong>introducing a fuzz testing suite</strong> to
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!