New Google Framework Challenges AI Benchmarking Norms
A new study from Google Research is challenging long-standing assumptions around how AI systems are evaluated, arguing that current benchmarking may be overlooking a critical factor: human disagreement. In the research, Google researchers outline a new evaluation framework for ML models based on “gold” ratings data. The framework is designed to optimise the balance between […] The post New Google Framework Challenges AI Benchmarking Norms appeared first on DIGIT .
A new study from Google Research is challenging long-standing assumptions around how AI systems are evaluated, arguing that current benchmarking may be overlooking a critical factor: human disagreement.
In the research, Google researchers outline a new evaluation framework for ML models based on “gold” ratings data. The framework is designed to optimise the balance between the number of items being assessed and the number of human raters assigned to each item, with the aim of improving reproducibility in AI benchmarking.
Reproducibility, a cornerstone of machine learning research, refers to how easily an experiment can be repeated using the same code, data, and parameters to achieve the same results. High reproducibility allows teams to validate findings and build on each other’s work – however, the study highlights a persistent challenge: human-labelled data, which underpins many AI evaluation processes, is inherently subjective.
“The challenge with reproducibility is that ground truth data usually relies on humans; and humans, unlike machines, approach all problems from a variety of perspectives and often disagree on the result,” the researchers state. Despite this, much of the industry continues to rely on simplified approaches that collapse multiple human responses into a single “correct” label, often using plurality voting.
According to the paper, this approach masks important variation in human judgement. Two datasets may share the same majority label while reflecting very different levels of agreement among raters, leading to potential distortions in how model performance is assessed.
Rethinking the ‘forest vs tree’ trade-off
The research, titled Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation, explores whether it is more effective to gather ratings from a small number of individuals across many items, or from a larger group of people assessing fewer items. The authors frame this as a trade-off between breadth and depth.
Historically, AI benchmarking has favoured the “forest” approach, with most studies relying on between one and five raters per item. This is typically assumed to be sufficient to establish a single ground truth. However, the findings suggest that this standard may fall short when it comes to capturing the full range of human perspectives.
“Our research suggests this standard is often insufficient at capturing natural disagreement, and we provide a roadmap for building more reliable and cost efficient AI benchmarks,” the authors explain.
To investigate the issue, the team developed a simulation framework designed to model how different evaluation strategies perform under varying budget constraints. The simulator adjusts two key variables: the number of items being rated and the number of raters assigned to each item. By testing thousands of combinations, the researchers were able to identify configurations that produced statistically reliable and reproducible results.
The experiments drew on several real-world datasets, including a toxicity dataset comprising over 107,000 social media comments labelled by more than 17,000 raters, as well as conversational AI safety data, cross-cultural offensiveness datasets, and job-related social media content. These datasets allowed the researchers to test how different evaluation strategies perform across a range of subjective tasks.
The study also examined how “messy” data conditions, such as highly imbalanced datasets or multiple classification categories, influence the optimal balance between scale and depth.
Key findings challenge AI evaluation norms
The findings point to three key conclusions. First, the widely used benchmark of three to five raters per item is often inadequate. “Our results show that the common practice of using 1, 3 or 5 raters per item is often insufficient,” the researchers note, adding that “practitioners often need more than 10 raters per item” to achieve reliable results that reflect human nuance.
Second, the optimal evaluation strategy depends on the metric being measured. For tasks focused on accuracy and majority agreement, increasing the number of items may be more effective. However, for tasks that require capturing the full range of human opinion, increasing the number of raters per item becomes essential.
Finally, the research suggests that improved reproducibility does not necessarily require significantly larger budgets. By carefully selecting the ratio between items and raters, the study finds that “one can achieve highly reproducible results with a modest budget of around 1,000 total annotations.” Conversely, poorly chosen configurations can undermine reliability, even with increased spending.
Moving beyond the ‘single truth’ paradigm
The implications extend beyond technical benchmarking into broader questions about how AI systems interpret and respond to subjective human concepts. The authors argue that the long-standing assumption of a single “correct” label is increasingly untenable, particularly as AI systems are applied to areas such as online safety, ethics, and social interaction.
“For years, the field has operated under a ‘single truth’ paradigm – the idea that for every input, there is one ‘right’ label,” the paper states. “But even when there’s a single ground-truth it may not be possible to measure it.”
Recommended reading
-
UK Offers £2.7M to Regulators for AI Training and Adoption
-
Workers Will Work with AI Agents, But Not For Them
-
Only 2% of Firms Ready for Agentic AI Risks, New Report Warns
Instead, the researchers advocate for a shift towards evaluation methods that explicitly account for disagreement, capturing both consensus and variation in human judgement. “By moving away from the ‘forest’ and embracing the ‘tree’, we can build benchmarks that actually reflect the complexity and different perspectives that lead to the natural disagreement found in the human world,” they conclude.
The research team has also open-sourced its simulation tool on GitHub, with the aim of supporting wider adoption of more robust evaluation practices.
The work was conducted in collaboration with PhD student Deepak Pandita and Professor Christopher Homan at the Rochester Institute of Technology.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelbenchmarkvaluation
New model, old risks: sociodemographic bias and adversarial hallucinations vulnerability in GPT-5
npj Digital Medicine, Published online: 04 April 2026; doi:10.1038/s41746-026-02584-8 We re-evaluated GPT-5 using our published pipelines: 500 emergency vignettes across 32 sociodemographic labels for bias, and adversarial prompts with fabricated details. GPT-5 showed no measurable improvement over GPT-4o in sociodemographic-linked decision variation, with several LGBTQIA+ groups flagged for mental-health screening in 100% of cases. Adversarial hallucination rates were higher (65% vs 53% for GPT-4o); a mitigation prompt reduced this to 7.67%.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

New model, old risks: sociodemographic bias and adversarial hallucinations vulnerability in GPT-5
npj Digital Medicine, Published online: 04 April 2026; doi:10.1038/s41746-026-02584-8 We re-evaluated GPT-5 using our published pipelines: 500 emergency vignettes across 32 sociodemographic labels for bias, and adversarial prompts with fabricated details. GPT-5 showed no measurable improvement over GPT-4o in sociodemographic-linked decision variation, with several LGBTQIA+ groups flagged for mental-health screening in 100% of cases. Adversarial hallucination rates were higher (65% vs 53% for GPT-4o); a mitigation prompt reduced this to 7.67%.






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!