Products model announce platform multimodal arxiv published

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

arXiv cs.CLby Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth GathoniApril 1, 20261 min read0 views

arXiv:2603.29244v1 Announce Type: new Abstract: We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility

View PDF HTML (experimental)

Abstract:We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as: arXiv:2603.29244 [cs.CL]

(or arXiv:2603.29244v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29244

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hillary Mutisya [view email] [v1] Tue, 31 Mar 2026 04:14:41 UTC (13 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2603.29244

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelannounceplatform

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxPZ2pNbEQyT1dhaDJRWllyZHVUQnBJd0d4WGZnMTg3RnRWYUpvOHJYOGNLMUc5NTdFU1J2dFJrdW5UejdtSF9zeXlVa1l3V09ValkxS1BwdlhzR2ZKLUR2QktrdDhiNlh1RVZxTjI3aVVpWVpJWWI0NjN2Q3d0ekdrS2YtVmc4MEN6ZjRQN3BWUTU0ZzJpT0Y1N01GN1UyT1ROeDJCb0gxR2xNYkNBZ0dHazdmeXlCQ2p0Tk8zR3RyM0lHVmc4QlRLVDRGeFptNXJ2WGR0bHR0QlJIb2psZjBsNzhhSnZaOFVqMnhQVUFoRzltLTFlMUdVQWl5WUJRX3NQSW1yOW1pTFpURkEzd2otMHFxRmtyNDEyZ2NTOVBkVHZCcGh1aEpURjFQQUNrNFBQX3ozUk4yV2xCejQ5RHY0elNibEtXSEhBZ1NDVWhRQzFieXNrMjRxb085RUtSY2pleHhCZ2UyWU1SdVZZcFo5U0JES01yQmtuUzFySWl3MW9iako4X3FYWXFuUGN0SUc2MXJUWUx6OE8zbW1BMm5YNXZSYTduUHNPazZ2QlgwZlNBdFNEX2RKWA?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> WSJ

Google News: LLM

1m1 day ago

CountriesFresh

Introducing LIMBO: Maintaining Optimal P(DOOM) (and a call for funding)

We are excited to publicly introduce the Laboratory for Importance-sampled Measure and Bayesian Observation (LIMBO), a small research group working at the intersection of cosmological theory, probability, and existential risk. We believe that the mechanisms by which observers continue to exist in the universe are important, neglected, and tractable to study and influence. Since our founding in October 2024, we have developed a mathematical framework for doing anthropic reasoning about rare-event estimation, and we have obtained significant empirical evidence which validates this framework. This empirical evidence was not cherry-picked: at LIMBO, we believe in putting our money where our mouth is, and we have a strong track record of success in financial and prediction markets downstream of

LessWrong AI

18mabout 7 hours ago

ModelsLive

I Turned helix-agent into helix-agents: One MCP Server for Ollama, Codex, and OpenAI-Compatible Models

If you use Claude Code heavily, you eventually hit the same wall: <ul> <li>some tasks are cheap enough for local models</li> <li>some tasks want a stronger coding agent</li> <li>some tasks are better sent to an API model</li> </ul> But many MCP servers still force one provider and one execution style. So I evolved <code>helix-agent</code> into helix-agents. It now lets Claude Code delegate work across: <ul> <li><code>ollama</code></li> <li><code>codex</code></li> <li><code>openai-compatible</code></li> </ul> from one MCP server. <h2> What changed </h2> The original project was focused on one thing: sending routine work to local Ollama models with automatic routing. The new version keeps that path, but adds: <ul> <li>multi-pr

DEV Community

3m31 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 213 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsLive

Do You Trust Me? A Framework For Making Networks of Robots and Vehicles Safer - Harvard School of Engineering and Applied Sciences

<a href="https://news.google.com/rss/articles/CBMingFBVV95cUxPYVl5ZjVmNll5RXVkUDRXdHhObDJTRXVsSzJnOUdKcEpOd0tvZV9ScWs2cFVrWWM4VF9TSXIzVXRUUWhBZzFQTk1heGxQTmxMaTZodEtjVkkyM1k0NWp1QnN0TWNidWR1Rjd5WW5xb0RKcFB2TXpwM3ZyVURjS1VkVFFMVFM0NkdORE5qVFpCV0RUT2UyZkxZejkxS0xCZw?oc=5" target="_blank">Do You Trust Me? A Framework For Making Networks of Robots and Vehicles Safer</a> Harvard School of Engineering and Applied Sciences

Google News: Machine Learning

1m17 minutes ago

ProductsLive

Chat, is this sus?

A large assumption we have made in AI control is that humans will be perfect at auditing , that is, being shown a transcript and determining if the AI was scheming in that transcript. But we are uncertain whether humans will be perfect at auditing; they are prone to fatigue and distraction. That is why I’m releasing "Sentinel" today, an extremely high-stimulation way to audit boring transcripts. Sentinel is a revolutionary way to get more juice out of your human auditors by gamifying the auditing process with a level system, perks, power-ups, and more fun features. Try it now here . In AI control literature, we love finding the safety/usefulness trade-offs of everything we create, but surprisingly, we noticed no trade-offs with this product The rest of the post will go over some of the way

LessWrong AI

2m32 minutes ago

ProductsLive

ça ressemble à quoi, mon setup Claude Code ?

Dans ma veille, je vois passer beaucoup de guides de setup avec 18.000 skills et 5000 hooks pour répondre à tous les besoins mais peu de REX de setup en situation réelle. Pendant ce temps, les collègues ont vu la lumière et basculent vers Claude Code et ... se perdent dans les possibilités. J'ai décidé de vous montrer mon setup Claude Code — c'est ce qui tient après 6 mois, et dans quel ordre je l'aurais fait si c'était à refaire. Pendant 6 mois, j'ai configuré et joué sur plusieurs paramètres (claude.md, config MCP, settings, skills). J'ai repris plein de bonnes idées de <a class="mentioned-user" href="https://dev.to/florian">@florian</a> Brugniaux qu'il a stockées dans son (<a href="https://cc.bruniaux.com/" rel="noopener noreferrer">claude code ultimate guide</a>.

DEV Community

15m31 minutes ago

ProductsLive

🔬 3D Science Lab — Interactive 3D STEM Education with 40+ Experiments Built Using Next.js and Three.js

<h2> Making Science Interactive </h2> Traditional science education relies on static textbook diagrams and 2D illustrations. But science happens in three dimensions. I built 3D Science Lab to make STEM education immersive — allowing students to interact with experiments in 3D, rotate models, zoom in on details, and truly understand the science behind what they see. <h2> What is 3D Science Lab? </h2> 3D Science Lab is an interactive web platform featuring 40+ 3D science experiments across four core disciplines: <ul> <li> Physics — mechanics, optics, waves, electricity</li> <li> Chemistry — molecular structures, reactions, periodic table in 3D</li> <li> Biology — cell structures, organ systems,

DEV Community

2m31 minutes ago