Live
Black Hat USADark ReadingBlack Hat AsiaAI Business14. Observability in AI Systems – How to Know What Your AI Is Actually DoingMedium AIAI Citation Registries and Provenance Absence Failure ModesDev.to AIGitHub Actions for AI: Automating NeuroLink in Your CI/CD PipelineDev.to AIWorld-Building with Persistence: Narrative Layers in AI AgentsDev.to AIBuilding a Claude Agent with Persistent Memory in 30 MinutesDev.to AISamsung, Mistral AI in talks for stable chip supply - NewsBytesGNews AI Mistral50 Useful Prompts I Use in Gemini That Actually Save Me TimeMedium AIThe Two-Line Prompt That Made 7 AIs Develop Distinct PersonalitiesMedium AINLP Token Classification Explained Simply (NER, POS, Chunking + Code)Medium AIAutomate Your Grant Workflow: A Practical AI Guide for NonprofitsDev.to AIYou’re Not Thinking Anymore — AI Is Doing It for YouMedium AIYour LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI ReliabilityDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI Business14. Observability in AI Systems – How to Know What Your AI Is Actually DoingMedium AIAI Citation Registries and Provenance Absence Failure ModesDev.to AIGitHub Actions for AI: Automating NeuroLink in Your CI/CD PipelineDev.to AIWorld-Building with Persistence: Narrative Layers in AI AgentsDev.to AIBuilding a Claude Agent with Persistent Memory in 30 MinutesDev.to AISamsung, Mistral AI in talks for stable chip supply - NewsBytesGNews AI Mistral50 Useful Prompts I Use in Gemini That Actually Save Me TimeMedium AIThe Two-Line Prompt That Made 7 AIs Develop Distinct PersonalitiesMedium AINLP Token Classification Explained Simply (NER, POS, Chunking + Code)Medium AIAutomate Your Grant Workflow: A Practical AI Guide for NonprofitsDev.to AIYou’re Not Thinking Anymore — AI Is Doing It for YouMedium AIYour LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI ReliabilityDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

AI benchmarks systematically ignore how humans disagree, Google study finds

The Decoderby Jonathan KemperApril 5, 20261 min read0 views
Source Quiz

A Google study finds that the standard three to five human raters per test example often aren't enough for reliable AI benchmarks, and that splitting your annotation budget the right way matters just as much as the budget itself. The article AI benchmarks systematically ignore how humans disagree, Google study finds appeared first on The Decoder .

How many evaluators does a good AI benchmark actually need? New research shows that the standard three to five raters per test example often aren't enough, and that how you allocate your annotation budget matters just as much as how big it is.

When AI models go head-to-head, human evaluations often decide which one comes out on top. Evaluators rate things like whether a comment is toxic or whether a chatbot response is safe.

The problem is that people frequently disagree on these calls. Standard practice in AI research is to collect three to five ratings per example and pick a single "correct" answer by majority vote. That approach systematically throws out the diversity of human opinion.

Both comments get the same "Toxic" label by majority vote, even though evaluators in the second case disagree significantly. Standard benchmarks ignore this difference entirely. | Image: Google

Researchers from Google Research and the Rochester Institute of Technology wanted to find a smarter way to spend a limited rating budget. The key question: Is it better to evaluate as many test examples as possible or to have fewer examples rated by a lot more people?

The researchers frame the dilemma with a simple restaurant analogy. Imagine asking 1,000 guests to each sample a single dish: you'd get a broad but shallow snapshot. Now imagine asking 20 diners to rate the same 50 dishes. You'd walk away with a far richer picture of what's actually good and what isn't. Today's AI benchmarks overwhelmingly follow the first model, casting a wide net across test examples while collecting only a thin layer of human judgment for each one.

Stress-testing thousands of budget splits

To find the sweet spot, the team built a simulator that replicates human rating patterns using real datasets. The simulator generates synthetic evaluation data for two models, with one performing worse than the other in a controlled way. This setup makes it possible to test which conditions let you reliably detect the difference between models.

When comparing models, both AI systems and human raters evaluate the same text. A metric then determines which model gets closer to the human judgments. | Image: Google

The team calibrated the simulator against five real datasets covering toxicity detection, chatbot safety, and cross-cultural offensiveness assessment. All told, they tested thousands of combinations across different total budgets and rater counts per example.

Fewer than ten raters per example isn't cutting it

The results put current practice in question. The typical one to five raters per test example often aren't enough to make model comparisons reproducible, according to the study. For statistically reliable results that actually capture the range of human opinion, you generally need more than ten raters per example.

More raters per example means more reliable detection of differences between models. The effect is especially strong with smaller budgets. | Image: Google

The experiments also show that reliable results can often be achieved with around 1,000 total annotations, but only if the budget is split correctly between test examples and raters. Get the balance wrong, and you can end up with unreliable conclusions even on a much larger budget, the researchers say.

What you measure should dictate how you spend

The biggest takeaway is that there's no one-size-fits-all ratio. The right strategy depends entirely on what you're trying to measure.

If you're using accuracy—checking whether a model agrees with the evaluators' majority vote—a wide approach works best: as many test examples as possible with just a few raters each. Accuracy only looks at the most common answer, so extra raters barely move the needle.

But if you want to capture the full spread of human responses—using a metric like total variation, for instance—you need the opposite playbook. Fewer test examples, but way more raters per example. That's the only way to map how much evaluators actually agree or disagree.

Different examples can get the same majority-vote label yet have very different response distributions underneath. In the experiments, this distribution-aware metric also needed the smallest overall budget to produce reliable results.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Subscribe now

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
AI benchmar…benchmarkstudyThe Decoder

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!