Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessJPMorgan CEO Jamie Dimon in annual letter cites risks in geopolitics, AI and private marketsCNBC TechnologyAI shutdown controls may not work as expected, new study suggests - ComputerworldGoogle News: Generative AIOpenAI Advocates Electric Grid, Safety Net Spending for New AI EraBloomberg Technology27 questions to ask when choosing an LLM - InfoWorldGoogle News: LLMJapan, driven by labor shortages, is increasingly adopting robotics and physical AI, with a hybrid model where startups innovate and corporations provide scale (Kate Park/TechCrunch)TechmemeAnthropic tells OpenClaw users to pay up - The Rundown AIGoogle News: ClaudeDeepSeek V4 points to growing use of Huawei chips in AI models - Tech Wire AsiaGoogle News: Generative AIANALYSIS: Q1 IPOs ‘Forge’ Ahead as OpenAI, SpaceX Look to Debuts - Bloomberg Law NewsGoogle News: OpenAINew track in artificial intelligence added to Arkansas Tech University curriculum - River Valley Democrat-GazetteGoogle News: AIDeepMind Calls for New Safeguards Against AI Agent Exploitation - The420.inGoogle News: DeepMindChatGPT web service hit by brief disruption, OpenAI investigates - news.cgtn.comGoogle News: ChatGPTAgile Robots and Google DeepMind Partner on AI-Driven Industrial Robotics - ARC AdvisoryGoogle News: DeepMindBlack Hat USADark ReadingBlack Hat AsiaAI BusinessJPMorgan CEO Jamie Dimon in annual letter cites risks in geopolitics, AI and private marketsCNBC TechnologyAI shutdown controls may not work as expected, new study suggests - ComputerworldGoogle News: Generative AIOpenAI Advocates Electric Grid, Safety Net Spending for New AI EraBloomberg Technology27 questions to ask when choosing an LLM - InfoWorldGoogle News: LLMJapan, driven by labor shortages, is increasingly adopting robotics and physical AI, with a hybrid model where startups innovate and corporations provide scale (Kate Park/TechCrunch)TechmemeAnthropic tells OpenClaw users to pay up - The Rundown AIGoogle News: ClaudeDeepSeek V4 points to growing use of Huawei chips in AI models - Tech Wire AsiaGoogle News: Generative AIANALYSIS: Q1 IPOs ‘Forge’ Ahead as OpenAI, SpaceX Look to Debuts - Bloomberg Law NewsGoogle News: OpenAINew track in artificial intelligence added to Arkansas Tech University curriculum - River Valley Democrat-GazetteGoogle News: AIDeepMind Calls for New Safeguards Against AI Agent Exploitation - The420.inGoogle News: DeepMindChatGPT web service hit by brief disruption, OpenAI investigates - news.cgtn.comGoogle News: ChatGPTAgile Robots and Google DeepMind Partner on AI-Driven Industrial Robotics - ARC AdvisoryGoogle News: DeepMind
AI NEWS HUBbyEIGENVECTOREigenvector

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

arXiv cs.IRby Klaudia Thellmann, Bernhard Stadler, Michael F\"arberApril 3, 20261 min read0 views
Source Quiz

arXiv:2604.01957v1 Announce Type: cross Abstract: Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datase

View PDF HTML (experimental)

Abstract:Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

Comments: Accepted at LREC 2026

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as: arXiv:2604.01957 [cs.CL]

(or arXiv:2604.01957v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01957

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Michael Färber [view email] [v1] Thu, 2 Apr 2026 12:20:16 UTC (456 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkreleaseannounce

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Diagnosing …benchmarkreleaseannounceversionservicetrendarXiv cs.IR

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 225 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models