Models benchmark release announce version service trend

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

arXiv cs.IRby Klaudia Thellmann, Bernhard Stadler, Michael F\"arberApril 3, 20261 min read0 views

arXiv:2604.01957v1 Announce Type: cross Abstract: Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datase

View PDF HTML (experimental)

Abstract:Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

Comments: Accepted at LREC 2026

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as: arXiv:2604.01957 [cs.CL]

(or arXiv:2604.01957v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01957

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Michael Färber [view email] [v1] Thu, 2 Apr 2026 12:20:16 UTC (456 KB)

Original source

arXiv cs.IR

https://arxiv.org/abs/2604.01957

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkreleaseannounce

ReleasesLive

OpenAI Advocates Electric Grid, Safety Net Spending for New AI Era

OpenAI has released a set of policy recommendations meant to help navigate an era of artificial intelligence-fueled upheaval including suggesting the creation of a public wealth fund, fast-response social safety net programs and speedier electrical grid development.

Bloomberg Technology

1m39 minutes ago

ProductsFresh

UniCon: A Unified System for Efficient Robot Learning Transfers

arXiv:2601.14617v2 Announce Type: replace Abstract: Deploying learning-based controllers across heterogeneous robots is challenging due to platform differences, inconsistent interfaces, and inefficient middleware. To address these issues, we present UniCon, a lightweight framework that standardizes states, control flow, and instrumentation across platforms. It decomposes workflows into execution graphs with reusable components, separating system states from control logic to enable plug-and-play deployment across various robot morphologies. Unlike traditional middleware, it prioritizes efficiency through batched, vectorized data flow, minimizing communication overhead and improving inference latency. This modular, data-oriented approach enables seamless sim-to-real transfer with minimal re-

arXiv cs.RO

1mabout 6 hours ago

ProductsFresh

A Survey of Real-Time Support, Analysis, and Advancements in ROS 2

arXiv:2601.10722v2 Announce Type: replace Abstract: The Robot Operating System 2 (ROS~2) has emerged as a relevant middleware framework for robotic applications, offering modularity, distributed execution, and communication. In the last six years, ROS~2 has drawn increasing attention from the real-time systems community and industry. This survey presents a comprehensive overview of research efforts that analyze, enhance, and extend ROS~2 to support real-time execution. We first provide a detailed description of the internal scheduling mechanisms of ROS~2 and its layered architecture, including the interaction with DDS-based communication and other communication middleware. We then review key contributions from the literature, covering timing analysis for both single- and multi-threaded exe

arXiv cs.RO

2mabout 6 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 225 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

DeepSeek V4 points to growing use of Huawei chips in AI models - Tech Wire Asia

DeepSeek V4 points to growing use of Huawei chips in AI models Tech Wire Asia

Google News: Generative AI

1mabout 1 hour ago

ModelsFresh

An Open-Source LiDAR and Monocular Off-Road Autonomous Navigation Stack

arXiv:2604.03096v1 Announce Type: new Abstract: Off-road autonomous navigation demands reliable 3D perception for robust obstacle detection in challenging unstructured terrain. While LiDAR is accurate, it is costly and power-intensive. Monocular depth estimation using foundation models offers a lightweight alternative, but its integration into outdoor navigation stacks remains underexplored. We present an open-source off-road navigation stack supporting both LiDAR and monocular 3D perception without task-specific training. For the monocular setup, we combine zero-shot depth prediction (Depth Anything V2) with metric depth rescaling using sparse SLAM measurements (VINS-Mono). Two key enhancements improve robustness: edge-masking to reduce obstacle hallucination and temporal smoothing to mit

arXiv cs.RO

1mabout 6 hours ago

ModelsFresh

FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

arXiv:2604.03139v1 Announce Type: new Abstract: Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots

arXiv cs.RO

1mabout 6 hours ago

ModelsLive

ChatGPT web service hit by brief disruption, OpenAI investigates - news.cgtn.com

ChatGPT web service hit by brief disruption, OpenAI investigates news.cgtn.com

Google News: ChatGPT

1mabout 1 hour ago