Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
arXiv:2603.26648v1 Announce Type: cross Abstract: Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype ima — Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang
View PDF HTML (experimental)
Abstract:Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.26648 [cs.SE]
(or arXiv:2603.26648v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2603.26648
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Zehai He [view email] [v1] Fri, 27 Mar 2026 17:50:45 UTC (25,879 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Akira Hackers Shrink Encryption Timeline to Under One Hour
A notorious ransomware group has been observed leveraging long‑standing exploits and stolen credentials to slip past MFA protections and execute attacks in as little as one hour. Tracking the well-known Akira ransomware group, security researchers from Halcyon witnessed hackers abusing CVE-2024-40766 to gain unauthorised access to SonicWall management interfaces and configuration backups on unpatched devices. [ ] The post Akira Hackers Shrink Encryption Timeline to Under One Hour appeared first on DIGIT .
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
DynaVid addresses limitations in video diffusion models by using synthetic motion data represented as optical flow to improve realistic video synthesis with dynamic motions and fine-grained motion control. (2 upvotes on HuggingFace)
Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation using a shared sequence space with cross-modal consistency as an implicit structural constraint. (1 upvotes on HuggingFace)




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!