Models model language model benchmark announce available open source

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

arXiv cs.CLby Subhadip MitraApril 1, 20261 min read0 views

arXiv:2603.28769v1 Announce Type: cross Abstract: Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model compar

View PDF HTML (experimental)

Abstract:Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar's test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.

Comments: 16 pages, 2 figures, 6 tables. Open source: this https URL. Cross-list requested: cs.CL, cs.LG

Subjects:

Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)

ACM classes: I.2.7; H.3.4; D.2.8

Cite as: arXiv:2603.28769 [cs.DC]

(or arXiv:2603.28769v1 [cs.DC] for this version)

https://doi.org/10.48550/arXiv.2603.28769

arXiv-issued DOI via DataCite

Submission history

From: Subhadip Mitra [view email] [v1] Sun, 18 Jan 2026 04:34:39 UTC (81 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2603.28769

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Models

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - WSJ

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model WSJ

GNews AI Llama

1m11 months ago

ProductsLive

Highlights from my conversation about agentic engineering on Lenny's Podcast

I was a guest on Lenny Rachitsky's podcast, in a new episode titled An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines . It's available on YouTube , Spotify , and Apple Podcasts . Here are my highlights from our conversation, with relevant links. The November inflection point Software engineers as bellwethers for other information workers Writing code on my phone Responsible vibe coding Dark Factories and StrongDM The bottleneck has moved to testing This stuff is exhausting Interruptions cost a lot less now My ability to estimate software is broken It's tough for people in the middle It's harder to evaluate software The misconception that AI tools are easy Coding agents are useful for security research now OpenClaw Journalists a