Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessPakistan’s peace plan a ‘critical opportunity’ for US-Iran talks ahead of Trump deadlineSCMP Tech (Asia AI)Why Microservices Struggle With AI SystemsHackernoon AIAgentic AI Vision System: Object Segmentation with SAM 3 and QwenPyImageSearchWhy APEX Matters for MoE Coding Models and why it's NOT the same as K quantsReddit r/LocalLLaMAGoogle Study: AI Benchmarks Use Too Few Raters to Be Reliable - WinBuzzerGNews AI benchmarkNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through potteryThe Guardian AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through pottery - The GuardianGNews AI ethicsI gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindRoche Launches AI Factory with NVIDIA to Accelerate Drug Discovery and Diagnostics - The Healthcare Technology Report.GNews AI NVIDIABlack Hat USADark ReadingBlack Hat AsiaAI BusinessPakistan’s peace plan a ‘critical opportunity’ for US-Iran talks ahead of Trump deadlineSCMP Tech (Asia AI)Why Microservices Struggle With AI SystemsHackernoon AIAgentic AI Vision System: Object Segmentation with SAM 3 and QwenPyImageSearchWhy APEX Matters for MoE Coding Models and why it's NOT the same as K quantsReddit r/LocalLLaMAGoogle Study: AI Benchmarks Use Too Few Raters to Be Reliable - WinBuzzerGNews AI benchmarkNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through potteryThe Guardian AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through pottery - The GuardianGNews AI ethicsI gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindRoche Launches AI Factory with NVIDIA to Accelerate Drug Discovery and Diagnostics - The Healthcare Technology Report.GNews AI NVIDIA
AI NEWS HUBbyEIGENVECTOREigenvector

Cost-Efficient Estimation of General Abilities Across Benchmarks

arXiv cs.CLby Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris TannerApril 4, 20262 min read0 views
Source Quiz

arXiv:2604.01418v1 Announce Type: new Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique

View PDF

Abstract:Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.01418 [cs.CL]

(or arXiv:2604.01418v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01418

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Michael Krumdick [view email] [v1] Wed, 1 Apr 2026 21:34:33 UTC (1,001 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Cost-Effici…modellanguage mo…benchmarkannouncevaluationanalysisarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 202 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models