Models model language model benchmark announce available application

SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

ArXiv CS.AIby Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Nathaniel Gorski, Jianxin Sun, Guoxi Liu, Helgi I. Ingolfsson, David Lenz, Hanqi Guo, Hongfeng Yu, Teja Leburu, Michael Molash, Bei Wang, Tom Peterka, Chaoli Wang, Shusen LiuApril 1, 20261 min read0 views

Source Quiz

arXiv:2603.29139v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we

Authors:Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Nathaniel Gorski, Jianxin Sun, Guoxi Liu, Helgi I. Ingolfsson, David Lenz, Hanqi Guo, Hongfeng Yu, Teja Leburu, Michael Molash, Bei Wang, Tom Peterka, Chaoli Wang, Shusen Liu

View PDF HTML (experimental)

Abstract:Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at this https URL.

Subjects:

Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)

Cite as: arXiv:2603.29139 [cs.AI]

(or arXiv:2603.29139v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29139

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Kuangshi Ai [view email] [v1] Tue, 31 Mar 2026 01:41:28 UTC (38,545 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2603.29139

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ModelsFresh

AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted - wired.com

<a href="https://news.google.com/rss/articles/CBMijAFBVV95cUxOSWM1R1Y2THUxVzRaX2E1ZHBkekdrSGktcG0tbFFzV3k4emJXUWpDVkpJMWhKM1g4VXB2WktnWWl4dWQwSWhVQTF1ZzFMVlhJdnluTks5UzNEeXh5bWZsVUIyYktJMnUwNC14LTJ3TDZnRXNDS0FPelEwNWtHSFFpQ0xqd2dfNU45Zi1fag?oc=5" target="_blank">AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted</a> wired.com

Google News: AI

1mabout 3 hours ago

ProductsLive

WaterNSW Adopts Generative AI For Applications - Let's Data Science

<a href="https://news.google.com/rss/articles/CBMikgFBVV95cUxNVkdHejJmSW85UzVuSEJnRUpBbHpORklieU1QeGU2cS1qZjRhdGZKZG55VHRQN2p4Sk9QNk12LTE5aXpaaEFxc0Nvb0xpcjJkRXlFeDc4T3hLQkdBOEltNFUzeWRZWkVjZ2RxR0FlQS1qZWg4cFRRZmVhbXBPbXFIUHZ5bzVCSUFXMEVCclFjR2RSUQ?oc=5" target="_blank">WaterNSW Adopts Generative AI For Applications</a> Let's Data Science

Google News: Generative AI

1mabout 1 hour ago

ModelsLive

The Fallback That Never Fires

Your agent hits a rate limit. The fallback logic kicks in, picks an alternative model. Everything should be fine. Except the request still goes to the original model. And gets rate-limited again. And again. Forever. <h2> The Setup </h2> When your primary model returns 429: <ol> <li>Fallback logic detects rate_limit_error</li> <li>Selects next model in the fallback chain</li> <li>Retries with the fallback model</li> <li>User never notices</li> </ol> OpenClaw has had model fallback chains for months, and they generally work well. <h2> The Override </h2> <a href="https://github.com/openclaw/openclaw/issues/59213" rel="noopener noreferrer">Issue #59213</a> exposes a subtle timing problem. Between steps 2 and 3, there is another system: session model recon

DEV Community

3mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 168 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted - wired.com

Google News: AI

1mabout 3 hours ago

ModelsLive

Tagged: claude ai - Crowdfund Insider

<a href="https://news.google.com/rss/articles/CBMiW0FVX3lxTFBMd3FCREZHVTY2ZWVyQUVILU1CQ0VZNG43MExCbjdiNGV5WW9XaTlqR2txUlJRN2dkeTJ1ZDE0bnZsNW5GVmlYS01tWWhwSzFhVkkzT0dRWWthbGc?oc=5" target="_blank">Tagged: claude ai</a> Crowdfund Insider

Google News: Claude

1mabout 1 hour ago

ModelsLive

Anthropic Executive Blames Claude Code Leak on ‘Process Errors’ - Bloomberg.com

<a href="https://news.google.com/rss/articles/CBMisgFBVV95cUxPNEZxSzM5eEUza21meWE4dndaM2N2QWhhTFFXdmdLeE84bzNyaU1lcy1QLU43RlZ3VWxtbElzcDJkVWJDV25PV0ZIRWNzbG5hazlLeEZpWm9VZXFEQkxaQ0N0LTdLYm9HcW13STZIa3FadU1HQTVHSkFKX25UOU1JQUhqY2pneGhNcE5PMDZiT3JVWG9kOGJjYUJXNE10REJLQ0EtalYxT05DTElMOC1BcHpR?oc=5" target="_blank">Anthropic Executive Blames Claude Code Leak on ‘Process Errors’</a> Bloomberg.com

Google News: Claude

1mabout 2 hours ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com

<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxPb0E5MnYyZ0VibktLdklKaXBOOWtKRC1HS3pPaVNCNGVqRVZFVEI3empFUGZEM2hlMHNicGI4V2l5ZXdkMkgwZFBnTHF6SXpCclgtNTRrZGpiWk5JcEdsb0gtYlI1OG1YVnBCQUhxNWNLcGFSeVdHSGdqNHYwZHkwVm9fTUhJOWtxYmp3RWUzalo3Vi1yV3UwaTd4WWR2cFo2czcyQlI3U1dvUHlWVFBjMDhBT2NzYnE2dWtvQkc5bmFOblhEWW8tUDNEQUg0WG5uUlE5RWNPOGs3T3QzSWEzencySnRNSXNWVjVVMGhDeFRXSW5TQ0gtYnc5UmRjX2IwVHFld21BSkpkaHFkV3ZsdXF2T0VSTDlFaGFXSU1pcEp5NGRkNVAtT2dpdzlGazhGbC16c2poZlpBV0YyLXduTTg0UjZZNGlIY0xNd3ppQU54MVlZT0loYlA2LU9DMk1MMGNTYlRHa3NYMDFweVFZZDZFNEZnZHRCZVhPQXpSMlU0dEU5VGdjcnB3T3ByZEUtODFSSTMzWTY3TWJoaU10eEd3?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> wsj.com

Google News: OpenAI

1m3 days ago