Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessCalifornia cements its role as the national testing ground for AI rules - AxiosGNews AI regulationEconomists Once Dismissed the A.I. Job Threat, but Not Anymore - nytimes.comGoogle News: AITech billionaires want to put data centers in space. The math could get ugly fast.Business InsiderStudent Parker Jones calls out college professors for being slow on AIBusiness InsiderThe clock is ticking on law's billable hour, says a top Cleary Gottlieb lawyerBusiness InsiderLocal colleges ready students for a workforce laden with artificial intelligence - WXXI NewsGoogle News: AIWhich cloud architecture decision do tech leaders regret most? Treating AI like just another workloadCIO MagazineScientists question why anyone would put data centers in space—a big bet for tech leaders like Elon Musk - Business InsiderGoogle News - Scale AI data90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole StoryDev.to AILarge language models: the AI systems clinicians are now encountering - Irish Medical TimesGoogle News: LLMDesktop Canary v2.1.48-canary.26LobeChat ReleasesMicrosoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National InfrastructureDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessCalifornia cements its role as the national testing ground for AI rules - AxiosGNews AI regulationEconomists Once Dismissed the A.I. Job Threat, but Not Anymore - nytimes.comGoogle News: AITech billionaires want to put data centers in space. The math could get ugly fast.Business InsiderStudent Parker Jones calls out college professors for being slow on AIBusiness InsiderThe clock is ticking on law's billable hour, says a top Cleary Gottlieb lawyerBusiness InsiderLocal colleges ready students for a workforce laden with artificial intelligence - WXXI NewsGoogle News: AIWhich cloud architecture decision do tech leaders regret most? Treating AI like just another workloadCIO MagazineScientists question why anyone would put data centers in space—a big bet for tech leaders like Elon Musk - Business InsiderGoogle News - Scale AI data90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole StoryDev.to AILarge language models: the AI systems clinicians are now encountering - Irish Medical TimesGoogle News: LLMDesktop Canary v2.1.48-canary.26LobeChat ReleasesMicrosoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National InfrastructureDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization

arXiv cs.CLby [Submitted on 1 Apr 2026]April 2, 20261 min read1 views
Source Quiz

arXiv:2604.00613v1 Announce Type: new Abstract: We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Tra

View PDF HTML (experimental)

Abstract:We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.00613 [cs.CL]

(or arXiv:2604.00613v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00613

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Daban Q. Jaff Mr [view email] [v1] Wed, 1 Apr 2026 08:14:25 UTC (134 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
English to …modeltransformerbenchmarkannouncevaluationmillionarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 156 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!