Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessHow SPACElab Has Integrated Science and Family Legacy to Craft Functional BeveragesInternational Business TimesDo You Trust Me? A Framework For Making Networks of Robots and Vehicles Safer - Harvard School of Engineering and Applied SciencesGoogle News: Machine LearningKubeCon Europe 2026: The Not-So-Unseen Engine Behind AI Innovation?Forrester AI Blog2. Mastering Time Series Forecasting with Python and timesfmDEV CommunityAirPods Max 2 reviewed: premium sound, top-tier ANC, same high priceTechSpotn8n Docker Setup: Why It Breaks (And the Easier Alternative)DEV Community1. Orchestrating AI Teams: A Python Guide to ChatDevDEV CommunityAI companies charge you 60% more based on your language, BPE tokensHacker NewsHow I Reverse-Engineered Claude Code's Hidden Pet SystemDEV Community@craft-ng: Associer l’art de la composition & du state management dans AngularDEV Community🔬 3D Science Lab — Interactive 3D STEM Education with 40+ Experiments Built Using Next.js and Three.jsDEV CommunityI Turned helix-agent into helix-agents: One MCP Server for Ollama, Codex, and OpenAI-Compatible ModelsDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessHow SPACElab Has Integrated Science and Family Legacy to Craft Functional BeveragesInternational Business TimesDo You Trust Me? A Framework For Making Networks of Robots and Vehicles Safer - Harvard School of Engineering and Applied SciencesGoogle News: Machine LearningKubeCon Europe 2026: The Not-So-Unseen Engine Behind AI Innovation?Forrester AI Blog2. Mastering Time Series Forecasting with Python and timesfmDEV CommunityAirPods Max 2 reviewed: premium sound, top-tier ANC, same high priceTechSpotn8n Docker Setup: Why It Breaks (And the Easier Alternative)DEV Community1. Orchestrating AI Teams: A Python Guide to ChatDevDEV CommunityAI companies charge you 60% more based on your language, BPE tokensHacker NewsHow I Reverse-Engineered Claude Code's Hidden Pet SystemDEV Community@craft-ng: Associer l’art de la composition & du state management dans AngularDEV Community🔬 3D Science Lab — Interactive 3D STEM Education with 40+ Experiments Built Using Next.js and Three.jsDEV CommunityI Turned helix-agent into helix-agents: One MCP Server for Ollama, Codex, and OpenAI-Compatible ModelsDEV Community

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

arXiv cs.CLby Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth GathoniApril 1, 20261 min read0 views
Source Quiz

arXiv:2603.29244v1 Announce Type: new Abstract: We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility

View PDF HTML (experimental)

Abstract:We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as: arXiv:2603.29244 [cs.CL]

(or arXiv:2603.29244v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29244

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hillary Mutisya [view email] [v1] Tue, 31 Mar 2026 04:14:41 UTC (13 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelannounceplatform

Knowledge Map

Knowledge Map
TopicsEntitiesSource
The Thiomi …modelannounceplatformmultimodalarxivpublishedarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 213 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products