Releases benchmark announce version valuation report study

Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild

ArXiv CS.AIby Deepak Akkil, Mowafak Allaham, Amal Raj, Tamer Abuelsaad, Ravi KokkuApril 1, 20261 min read0 views

arXiv:2603.29020v1 Announce Type: new Abstract: Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task-framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and re

View PDF HTML (experimental)

Abstract:Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task-framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and reporting. Emergence WebVoyager achieves an inter-annotator agreement of 95.9%, indicating improved clarity and reliability in both task formulation and evaluation. Applying this framework to evaluate OpenAI Operator reveals substantial performance variation across domains and task types, with an overall success rate of 68.6%, substantially lower than the 87% previously reported by OpenAI, demonstrating the utility of our approach for more rigorous and comparable web agent evaluation.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29020 [cs.AI]

(or arXiv:2603.29020v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29020

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Mowafak Allaham [view email] [v1] Mon, 30 Mar 2026 21:27:28 UTC (377 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2603.29020

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkannounceversion

ProductsLive

HDF5 vs. TsFile: Efficient Time-Series Data Storage

<p>In the era of big data, efficient data storage and management are critical to the success of both scientific research and industrial applications. <a href="https://www.hdfgroup.org/solutions/hdf5/" rel="noopener noreferrer">HDF5</a>, a hierarchical format for managing experimental data, and <a href="https://tsfile.apache.org" rel="noopener noreferrer">TsFile</a>, a modern time-series data storage format, each offer unique strengths and design philosophies. This article takes a deep dive into the origins, use cases, and limitations of HDF5, and explores the similarities and differences between HDF5 and TsFile.</p> <h2> Origins of HDF5 </h2> <p>HDF5, short for <em>Hierarchical Data Format version 5</em>, is more than just a file format. It encompasses a full data model, software libraries

DEV Community

13m25 minutes ago

ReleasesLive

There Is No Such Thing As a Service

<p>If you have been following this series, you know I am a fan of services. Dependency injection, single responsibility, clean boundaries between concerns. Done right, you end up with hundreds of services, each doing exactly one thing.</p> <p>But here is the problem nobody talks about: the word "service" does not actually mean anything.</p> <p><code>ItemService</code>. What does it do? Everything. What is inside? Who knows. You have to open it and start reading. And the more your codebase grows, the more that class becomes a dumping ground, a god class disguised by a reasonable name.</p> <p>I want to argue that the service as we know it is just one of many distinct types of classes we could be writing. And the moment you start thinking in terms of those types, your code becomes something a

DEV Community

6m29 minutes ago

ProductsLive

Step‑by‑Step Guide: Generate PowerPoint Slides Using Copilot Studio Agent

<h2> Introduction </h2> <p>Microsoft Copilot Studio allows you to create AI agents that automate tasks, including generating PowerPoint presentations. This guide walks you through creating a Copilot Studio agent that generates PowerPoint (PPT) slides automatically based on user input.</p> <h1> Prerequisites </h1> <p>Before you begin, ensure you have:</p> <ul> <li>Microsoft 365 account</li> <li>Access to Microsoft Copilot Studio</li> <li>Power Automate access</li> <li>SharePoint or OneDrive access (for storing generated PPT files)</li> <li>PowerPoint Online access</li> </ul> <h1> Step 1: Access Microsoft Copilot Studio </h1> <ol> <li>Go to Microsoft Copilot Studio</li> <li>Sign in with your Microsoft 365 account</li> <li>Click <strong>Create</strong> or <strong>New Copilot</strong> </li> <l

DEV Community

7m14 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 229 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesLive

There Is No Such Thing As a Service

DEV Community

6m29 minutes ago

ReleasesLive

ASUS Announces UGen300 USB AI Accelerator - ASUS Pressroom

<a href="https://news.google.com/rss/articles/CBMimgFBVV95cUxNMXh1czREZEtTeTBnZ2RRRGVwSGRQV0xXejFJSGNJNU5IYk81M0hUTE8xSmZZREJRRWhmZFdHb1otZk9CM3IxZmlFMEloOXgxTlJZM1lvTDNNNTFvNXNKV2RZQ01PbkxIWlhWMS1HNHF5SmlmdWZ0WTlhbUYydjdDeTJGN3hacUF2TDJLaGJWOTYzR29zTENuSEpR?oc=5" target="_blank">ASUS Announces UGen300 USB AI Accelerator</a> <font color="#6f6f6f">ASUS Pressroom</font>

Google News: Generative AI

1m42 minutes ago

ReleasesLive

HarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness Engineering

<p>There's a concept gaining traction in AI systems engineering: <strong>Harness Engineering</strong>.</p> <p>Not the testing tool. The idea: raw LLM capability is like raw power — high voltage, hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of building the control structures that make that power <em>usable at scale</em>.<br> Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers.</p> <p>I think it's going to be one of the defining disciplines of serious AI systems work.<br> And I've been building a platform around it.</p> <h2> What I Built </h2> <p><strong>HarnessOS</strong> is a scaffold/middleware system for running infinite autonomous tasks.</p> <p>The key word is <em>infinite</em>. Not one task. Not one session. An a

DEV Community

5m38 minutes ago

ReleasesLive

NH:STA S01E02 OpenPGP.js

<p>This post is part of a series on our work for the <a href="https://www.sovereign.tech/" rel="noopener noreferrer">Sovereign Tech Agency</a> (STA). Our first post in the series explains why and how we are contributing to various open source projects. </p> <h2> About the project </h2> <p><a href="https://openpgpjs.org" rel="noopener noreferrer">OpenPGP.js</a> is a pure, Open Source <a href="https://en.wikipedia.org/wiki/Pretty_Good_Privacy#OpenPGP" rel="noopener noreferrer">OpenPGP</a> implementation written in JavaScript. Its main use-case is enabling PGP workflows in web-based email systems, but as JavaScript is available on almost all devices these days, its utility is universal.</p> <h2> Our contributions </h2> <p>We started out by <strong>introducing a fuzz testing suite</strong> to

DEV Community

2mabout 1 hour ago