SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
arXiv:2603.29139v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we
Authors:Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Nathaniel Gorski, Jianxin Sun, Guoxi Liu, Helgi I. Ingolfsson, David Lenz, Hanqi Guo, Hongfeng Yu, Teja Leburu, Michael Molash, Bei Wang, Tom Peterka, Chaoli Wang, Shusen Liu
View PDF HTML (experimental)
Abstract:Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at this https URL.
Subjects:
Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
Cite as: arXiv:2603.29139 [cs.AI]
(or arXiv:2603.29139v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.29139
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Kuangshi Ai [view email] [v1] Tue, 31 Mar 2026 01:41:28 UTC (38,545 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmarkAI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted - wired.com
<a href="https://news.google.com/rss/articles/CBMijAFBVV95cUxOSWM1R1Y2THUxVzRaX2E1ZHBkekdrSGktcG0tbFFzV3k4emJXUWpDVkpJMWhKM1g4VXB2WktnWWl4dWQwSWhVQTF1ZzFMVlhJdnluTks5UzNEeXh5bWZsVUIyYktJMnUwNC14LTJ3TDZnRXNDS0FPelEwNWtHSFFpQ0xqd2dfNU45Zi1fag?oc=5" target="_blank">AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted</a> <font color="#6f6f6f">wired.com</font>
WaterNSW Adopts Generative AI For Applications - Let's Data Science
<a href="https://news.google.com/rss/articles/CBMikgFBVV95cUxNVkdHejJmSW85UzVuSEJnRUpBbHpORklieU1QeGU2cS1qZjRhdGZKZG55VHRQN2p4Sk9QNk12LTE5aXpaaEFxc0Nvb0xpcjJkRXlFeDc4T3hLQkdBOEltNFUzeWRZWkVjZ2RxR0FlQS1qZWg4cFRRZmVhbXBPbXFIUHZ5bzVCSUFXMEVCclFjR2RSUQ?oc=5" target="_blank">WaterNSW Adopts Generative AI For Applications</a> <font color="#6f6f6f">Let's Data Science</font>

The Fallback That Never Fires
<p>Your agent hits a rate limit. The fallback logic kicks in, picks an alternative model. Everything should be fine.</p> <p>Except the request still goes to the original model. And gets rate-limited again. And again. Forever.</p> <h2> The Setup </h2> <p>When your primary model returns 429:</p> <ol> <li>Fallback logic detects rate_limit_error</li> <li>Selects next model in the fallback chain</li> <li>Retries with the fallback model</li> <li>User never notices</li> </ol> <p>OpenClaw has had model fallback chains for months, and they generally work well.</p> <h2> The Override </h2> <p><a href="https://github.com/openclaw/openclaw/issues/59213" rel="noopener noreferrer">Issue #59213</a> exposes a subtle timing problem. Between steps 2 and 3, there is another system: <strong>session model recon
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted - wired.com
<a href="https://news.google.com/rss/articles/CBMijAFBVV95cUxOSWM1R1Y2THUxVzRaX2E1ZHBkekdrSGktcG0tbFFzV3k4emJXUWpDVkpJMWhKM1g4VXB2WktnWWl4dWQwSWhVQTF1ZzFMVlhJdnluTks5UzNEeXh5bWZsVUIyYktJMnUwNC14LTJ3TDZnRXNDS0FPelEwNWtHSFFpQ0xqd2dfNU45Zi1fag?oc=5" target="_blank">AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted</a> <font color="#6f6f6f">wired.com</font>
Tagged: claude ai - Crowdfund Insider
<a href="https://news.google.com/rss/articles/CBMiW0FVX3lxTFBMd3FCREZHVTY2ZWVyQUVILU1CQ0VZNG43MExCbjdiNGV5WW9XaTlqR2txUlJRN2dkeTJ1ZDE0bnZsNW5GVmlYS01tWWhwSzFhVkkzT0dRWWthbGc?oc=5" target="_blank">Tagged: claude ai</a> <font color="#6f6f6f">Crowdfund Insider</font>
Anthropic Executive Blames Claude Code Leak on ‘Process Errors’ - Bloomberg.com
<a href="https://news.google.com/rss/articles/CBMisgFBVV95cUxPNEZxSzM5eEUza21meWE4dndaM2N2QWhhTFFXdmdLeE84bzNyaU1lcy1QLU43RlZ3VWxtbElzcDJkVWJDV25PV0ZIRWNzbG5hazlLeEZpWm9VZXFEQkxaQ0N0LTdLYm9HcW13STZIa3FadU1HQTVHSkFKX25UOU1JQUhqY2pneGhNcE5PMDZiT3JVWG9kOGJjYUJXNE10REJLQ0EtalYxT05DTElMOC1BcHpR?oc=5" target="_blank">Anthropic Executive Blames Claude Code Leak on ‘Process Errors’</a> <font color="#6f6f6f">Bloomberg.com</font>
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxPb0E5MnYyZ0VibktLdklKaXBOOWtKRC1HS3pPaVNCNGVqRVZFVEI3empFUGZEM2hlMHNicGI4V2l5ZXdkMkgwZFBnTHF6SXpCclgtNTRrZGpiWk5JcEdsb0gtYlI1OG1YVnBCQUhxNWNLcGFSeVdHSGdqNHYwZHkwVm9fTUhJOWtxYmp3RWUzalo3Vi1yV3UwaTd4WWR2cFo2czcyQlI3U1dvUHlWVFBjMDhBT2NzYnE2dWtvQkc5bmFOblhEWW8tUDNEQUg0WG5uUlE5RWNPOGs3T3QzSWEzencySnRNSXNWVjVVMGhDeFRXSW5TQ0gtYnc5UmRjX2IwVHFld21BSkpkaHFkV3ZsdXF2T0VSTDlFaGFXSU1pcEp5NGRkNVAtT2dpdzlGazhGbC16c2poZlpBV0YyLXduTTg0UjZZNGlIY0xNd3ppQU54MVlZT0loYlA2LU9DMk1MMGNTYlRHa3NYMDFweVFZZDZFNEZnZHRCZVhPQXpSMlU0dEU5VGdjcnB3T3ByZEUtODFSSTMzWTY3TWJoaU10eEd3?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">wsj.com</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!