Models model benchmark valuation study research

New Google Framework Challenges AI Benchmarking Norms

Digit.fyiby DIGIT EditorApril 1, 20261 min read0 views

A new study from Google Research is challenging long-standing assumptions around how AI systems are evaluated, arguing that current benchmarking may be overlooking a critical factor: human disagreement. In the research, Google researchers outline a new evaluation framework for ML models based on “gold” ratings data. The framework is designed to optimise the balance between […] The post New Google Framework Challenges AI Benchmarking Norms appeared first on DIGIT .

In the research, Google researchers outline a new evaluation framework for ML models based on “gold” ratings data. The framework is designed to optimise the balance between the number of items being assessed and the number of human raters assigned to each item, with the aim of improving reproducibility in AI benchmarking.

Reproducibility, a cornerstone of machine learning research, refers to how easily an experiment can be repeated using the same code, data, and parameters to achieve the same results. High reproducibility allows teams to validate findings and build on each other’s work – however, the study highlights a persistent challenge: human-labelled data, which underpins many AI evaluation processes, is inherently subjective.

“The challenge with reproducibility is that ground truth data usually relies on humans; and humans, unlike machines, approach all problems from a variety of perspectives and often disagree on the result,” the researchers state. Despite this, much of the industry continues to rely on simplified approaches that collapse multiple human responses into a single “correct” label, often using plurality voting.

According to the paper, this approach masks important variation in human judgement. Two datasets may share the same majority label while reflecting very different levels of agreement among raters, leading to potential distortions in how model performance is assessed.

Rethinking the ‘forest vs tree’ trade-off

The research, titled Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation, explores whether it is more effective to gather ratings from a small number of individuals across many items, or from a larger group of people assessing fewer items. The authors frame this as a trade-off between breadth and depth.

Historically, AI benchmarking has favoured the “forest” approach, with most studies relying on between one and five raters per item. This is typically assumed to be sufficient to establish a single ground truth. However, the findings suggest that this standard may fall short when it comes to capturing the full range of human perspectives.

“Our research suggests this standard is often insufficient at capturing natural disagreement, and we provide a roadmap for building more reliable and cost efficient AI benchmarks,” the authors explain.

To investigate the issue, the team developed a simulation framework designed to model how different evaluation strategies perform under varying budget constraints. The simulator adjusts two key variables: the number of items being rated and the number of raters assigned to each item. By testing thousands of combinations, the researchers were able to identify configurations that produced statistically reliable and reproducible results.

The experiments drew on several real-world datasets, including a toxicity dataset comprising over 107,000 social media comments labelled by more than 17,000 raters, as well as conversational AI safety data, cross-cultural offensiveness datasets, and job-related social media content. These datasets allowed the researchers to test how different evaluation strategies perform across a range of subjective tasks.

The study also examined how “messy” data conditions, such as highly imbalanced datasets or multiple classification categories, influence the optimal balance between scale and depth.

Key findings challenge AI evaluation norms

The findings point to three key conclusions. First, the widely used benchmark of three to five raters per item is often inadequate. “Our results show that the common practice of using 1, 3 or 5 raters per item is often insufficient,” the researchers note, adding that “practitioners often need more than 10 raters per item” to achieve reliable results that reflect human nuance.

Second, the optimal evaluation strategy depends on the metric being measured. For tasks focused on accuracy and majority agreement, increasing the number of items may be more effective. However, for tasks that require capturing the full range of human opinion, increasing the number of raters per item becomes essential.

Finally, the research suggests that improved reproducibility does not necessarily require significantly larger budgets. By carefully selecting the ratio between items and raters, the study finds that “one can achieve highly reproducible results with a modest budget of around 1,000 total annotations.” Conversely, poorly chosen configurations can undermine reliability, even with increased spending.

Moving beyond the ‘single truth’ paradigm

The implications extend beyond technical benchmarking into broader questions about how AI systems interpret and respond to subjective human concepts. The authors argue that the long-standing assumption of a single “correct” label is increasingly untenable, particularly as AI systems are applied to areas such as online safety, ethics, and social interaction.

“For years, the field has operated under a ‘single truth’ paradigm – the idea that for every input, there is one ‘right’ label,” the paper states. “But even when there’s a single ground-truth it may not be possible to measure it.”

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarkvaluation

ModelsRecent

New model, old risks: sociodemographic bias and adversarial hallucinations vulnerability in GPT-5

npj Digital Medicine, Published online: 04 April 2026; doi:10.1038/s41746-026-02584-8 We re-evaluated GPT-5 using our published pipelines: 500 emergency vignettes across 32 sociodemographic labels for bias, and adversarial prompts with fabricated details. GPT-5 showed no measurable improvement over GPT-4o in sociodemographic-linked decision variation, with several LGBTQIA+ groups flagged for mental-health screening in 100% of cases. Adversarial hallucination rates were higher (65% vs 53% for GPT-4o); a mitigation prompt reduced this to 7.67%.