Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
arXiv:2603.28769v1 Announce Type: cross Abstract: Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model compar
View PDF HTML (experimental)
Abstract:Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar's test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.
Comments: 16 pages, 2 figures, 6 tables. Open source: this https URL. Cross-list requested: cs.CL, cs.LG
Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
ACM classes: I.2.7; H.3.4; D.2.8
Cite as: arXiv:2603.28769 [cs.DC]
(or arXiv:2603.28769v1 [cs.DC] for this version)
https://doi.org/10.48550/arXiv.2603.28769
arXiv-issued DOI via DataCite
Submission history
From: Subhadip Mitra [view email] [v1] Sun, 18 Jan 2026 04:34:39 UTC (81 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
Highlights from my conversation about agentic engineering on Lenny's Podcast
I was a guest on Lenny Rachitsky's podcast, in a new episode titled An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines . It's available on YouTube , Spotify , and Apple Podcasts . Here are my highlights from our conversation, with relevant links. The November inflection point Software engineers as bellwethers for other information workers Writing code on my phone Responsible vibe coding Dark Factories and StrongDM The bottleneck has moved to testing This stuff is exhausting Interruptions cost a lot less now My ability to estimate software is broken It's tough for people in the middle It's harder to evaluate software The misconception that AI tools are easy Coding agents are useful for security research now OpenClaw Journalists a

Silverback AI Chatbot Announces Expanded AI Chatbot Capabilities for Structured Digital Communication and Automated Interaction - El Paso Times
Silverback AI Chatbot Announces Expanded AI Chatbot Capabilities for Structured Digital Communication and Automated Interaction El Paso Times
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!