GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages
arXiv:2603.13793v2 Announce Type: replace-cross Abstract: Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annot — Lawrence Adu Gyamfi, Paul Azunre, Stephen Edward Moore, Joel Budu, Akwasi Asare, Mich-Seth Owusu, Jonathan Ofori Asiamah
View PDF
Abstract:Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability. These corpora are designed to support research, educational, and commercial applications, including machine translation, speech technologies, and language preservation. This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine. Overall, this work contributes to broader efforts to democratize AI by enabling inclusive and accessible language technologies for African languages.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.13793 [cs.CL]
(or arXiv:2603.13793v2 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.13793
arXiv-issued DOI via DataCite
Submission history
From: Akwasi Asare [view email] [v1] Sat, 14 Mar 2026 06:49:05 UTC (2,097 KB) [v2] Mon, 30 Mar 2026 15:19:36 UTC (2,097 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivWhy do I believe preserving structure is enough?
There's a lot even our best neuroscientists don't know about the human brain. How can we have any reasonable hope for preservation given those unknowns? What if there are crucial memory mechanisms that are so poorly understood, we don't even know to check whether our methods preserve them? As it turns out, there's some interesting empirical evidence about the general shape , and limits, of those unknowns. In Ted Chiang's short story Exhalation , a race of aliens have brains which run on compressed air, performing computations and storing information in elaborate arrangements of hinged gold-foil leaves. The leaves are held in position by a constant stream of air flowing through the brain's tubules, encoding alien thoughts and memories. That ephemeral suspension pattern is the whole self—any
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

Label-free pathological subtyping of non-small cell lung cancer using deep classification and virtual immunohistochemical staining
npj Digital Medicine, Published online: 03 April 2026; doi:10.1038/s41746-026-02557-x Label-free pathological subtyping of non-small cell lung cancer using deep classification and virtual immunohistochemical staining





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!