Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessNvidia Needs to Remind Itself What PC Gamers Actually WantGizmodoAI’s affect on communities, students, staff - USI | student newspaperGoogle News: Generative AI2 Artificial Intelligence (AI) Stocks I'd Buy With $1,000 Before They Rebound From the Tech Sell-Off - The Motley FoolGoogle News: AIGoogle Updates Gemini API Pricing Tiers for Optimization - Intellectia AIGoogle News: GeminiIran Says It Hit Oracle Facilities in UAEGizmodoInside the ethics of artificial intelligence - New Day NW - KING5.comGoogle News: AIAI Needs Memory— And The DRAM ETF Is All In On Micron, Samsung And Sandisk - Roundhill Memory ETF (BATS:D - BenzingaGNews AI SamsungMicrosoft Generative AI Report: The 40 Jobs Most Disrupted Jobs & The 40 Most Secure Jobs - HackerNoonGoogle News: Generative AIGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingQualcomm joins MassRobotics, to support startups with Dragonwing Robotics HubRobotics Business ReviewDisney, OpenAI Eye Future Deal After Sora Shutdown - lamag.comGoogle News: OpenAIThe BR Privacy, Security & AI Download: April 2026 - The National Law ReviewGNews AI cybersecurityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessNvidia Needs to Remind Itself What PC Gamers Actually WantGizmodoAI’s affect on communities, students, staff - USI | student newspaperGoogle News: Generative AI2 Artificial Intelligence (AI) Stocks I'd Buy With $1,000 Before They Rebound From the Tech Sell-Off - The Motley FoolGoogle News: AIGoogle Updates Gemini API Pricing Tiers for Optimization - Intellectia AIGoogle News: GeminiIran Says It Hit Oracle Facilities in UAEGizmodoInside the ethics of artificial intelligence - New Day NW - KING5.comGoogle News: AIAI Needs Memory— And The DRAM ETF Is All In On Micron, Samsung And Sandisk - Roundhill Memory ETF (BATS:D - BenzingaGNews AI SamsungMicrosoft Generative AI Report: The 40 Jobs Most Disrupted Jobs & The 40 Most Secure Jobs - HackerNoonGoogle News: Generative AIGeopolitics, AI, and Cybersecurity: Insights From RSAC 2026Dark ReadingQualcomm joins MassRobotics, to support startups with Dragonwing Robotics HubRobotics Business ReviewDisney, OpenAI Eye Future Deal After Sora Shutdown - lamag.comGoogle News: OpenAIThe BR Privacy, Security & AI Download: April 2026 - The National Law ReviewGNews AI cybersecurity
AI NEWS HUBbyEIGENVECTOREigenvector

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

arXiv cs.CLby Subhadip MitraApril 1, 20261 min read0 views
Source Quiz

arXiv:2603.28769v1 Announce Type: cross Abstract: Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model compar

View PDF HTML (experimental)

Abstract:Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar's test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.

Comments: 16 pages, 2 figures, 6 tables. Open source: this https URL. Cross-list requested: cs.CL, cs.LG

Subjects:

Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)

ACM classes: I.2.7; H.3.4; D.2.8

Cite as: arXiv:2603.28769 [cs.DC]

(or arXiv:2603.28769v1 [cs.DC] for this version)

https://doi.org/10.48550/arXiv.2603.28769

arXiv-issued DOI via DataCite

Submission history

From: Subhadip Mitra [view email] [v1] Sun, 18 Jan 2026 04:34:39 UTC (81 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Spark-LLM-E…modellanguage mo…benchmarkannounceavailableopen sourcearXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 173 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!