Separate Before You Compress: The WWHO Tokenization Architecture
Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "T — Kusal Darshana
View PDF HTML (experimental)
Abstract:Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.
Comments: 17 pages, 1 figure, 8 tables. Tokenization Architecture including formal DFA definitions and regular expressions for Sinhala and Devanagari syllabification. Evaluation includes comparisons with OpenAI o200k-base, Llama-4-Scout, and DeepSeek-V3. Source code and datasets: this https URL
Subjects:
Computation and Language (cs.CL)
MSC classes: 68Q45, 68T50, 68P30, 68W32
ACM classes: F.4; F.2.2; I.2.7
Cite as: arXiv:2603.25309 [cs.CL]
(or arXiv:2603.25309v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.25309
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Kusal Darshana [view email] [v1] Thu, 26 Mar 2026 10:56:48 UTC (1,400 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
How AIRA2 breaks AI research bottlenecks
While we've seen remarkable progress in AI for coding and mathematics, creating agents that can navigate the messy, open-ended nature of real research (where things break for no obvious reason) has proven far more challenging.
Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxQeVRmNi1ObGwyVWR3SURoakFvaDJ4bzBiaW9uRkZoVEp3WU96U05CeGd0LUF0SjhfMnRqajduN21hX1lXMVFHNmx5N0Z4Z1FmX0tpRHNqWVN0Wm9wQWFXeXgwRm9GTm1BRW1wSUR5WlptdF9tSGpWcktrb1NXMFRtMGRJaTNuYkk3ZFVTUF9nQ2ZHYUM0TWFaNDBiMG9NRFVGaFdHLUdiTkMxSldyaXBhZUI1V2wzc3BGZnlQVEgzTU1vMEoxcGtuOG9Zd0VkZW9zOXZXRWVKTGVIWUVEOEt5UVdFOUlWLUZ5ZFpYU3NqbUVUSVF3dXlIUkx0dl85cUM5cGVENS1jRS0wNGRkbTEyTXZUSmw1QTltdzR5ZlFnMV9XU3pueHF2TlJTZnhrSERmRmI5LWRtZFZyUzZXVnZVUDNWNzA1a3ctMEZ1THR1clRyT1Ywd2daTDlVS1RJdkxXdkNEdWtMbi1HRXVVRllqWVBEMFpHWXV4MzE1QXpqZnhfSFVEalkzUjVxd2dDWXltUlBhdWl6UXo5cVdxYU1OZ3JrNUNUcTBycjNBVFhFVHVoc1M0ZUFlNA?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>
Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxNWjFiT2ZQN1ZYQ1QxUUpxc2UzZjduQktEaW9DVjBib1hING0xOFptTUVBTUJQMFVVOUJ5eFptREliVkVtMXo1MlhwQy11c01YTlUwNWUzRjJ4dDM1T1hpOUdrcEdBR2czaDZvZ2V5Y1ZuRzFWSnlZQTNCOXR0d2ZXY015YjUya09FeXFHV2Fqd1htdlVwSDBBOWZhcTZmSmpfTjY3TVdfTWllV1RDQUd3a0dCT0NVUmdNSnF1Q2trM2xLdWdhcGx0aC1KRHMtcGJkSGFmTjZaNDNYZVQ5NnFpTk9wY1NkRkItRWZBVWJPQVdLcDhhYUdQaE1DMFdWbkp4VDd6a3dkSHVpVmhLZmItaUJTcWhQTWMtWlhfamVYT1FBQnBDS1VpWDFZZ3hnaFN0Qy1Ha2tUS2V1ZFJDYS1HczZjWFRRNkI1SlNxVFFNYzVwS2JWaGNQT3JXanFUNXZrdUw1UnFmMVAzaHpyUTI5QlBMdVI5SlRnTjdqbVNKdExKWC1jdzdMQTVFQkFySmo3TjBNRVQ4dmREdHJkQVhqWE1hQm5JTXlSelV1Vkt4OWNDRk95RnJRRg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
AI Inspires New Research Topics In Materials Science - miragenews.com
<a href="https://news.google.com/rss/articles/CBMihwFBVV95cUxQRlVFdkRBaHRvYkJJdFRlMTZmajEzeFRPU0hGWWdfbi02V1FnTUdVQ2pmY2VZLUV2NlB4V3BFdEVlSVZkUlhRSTZaNWFKMmcyWXJYbnNqbUhMTmp0NnFtMEppOXlPZkJSNHJfck5VSEVYcmUtX1k2QkJlR1BvUEdTTkp3UmlYRkk?oc=5" target="_blank">AI Inspires New Research Topics In Materials Science</a> <font color="#6f6f6f">miragenews.com</font>
From brain scans to alloys: Teaching AI to make sense of complex research data - Penn State University
<a href="https://news.google.com/rss/articles/CBMiwAFBVV95cUxPZDFHdkptQ2VUM2hmWjhqQkxoRnBiTWoxMXRRR21MUG5TamdUMlFRWmhvYVNHaFVNREVKU3VmSnVOdDVZYnNLb2ppYXRVRTZmVFVMV1pLTlVhUm9ybTNZbGtvZTdIMnIyMHNpOEk5aU9TSmxxS2Y4V2MwazYwY3JlX1Axbk1nd3pfcWhFdUJaaDJWRXJaMFIyTTROcmFHeXI3ZzFudXJ2M1h6UHI1LW1Ca1dta2RkM3BiYndocGk3Yjg?oc=5" target="_blank">From brain scans to alloys: Teaching AI to make sense of complex research data</a> <font color="#6f6f6f">Penn State University</font>

Locating Risk: Task Designers and the Challenge of Risk Disclosure in RAI Content Work
arXiv:2505.24246v4 Announce Type: replace Abstract: As AI systems are increasingly tested and deployed in open-ended and high-stakes domains, crowdworkers are often tasked with responsible AI (RAI) content work. These tasks include labeling violent content, moderating disturbing text, or simulating harmful behavior for red teaming exercises to shape AI system behaviors. While prior research efforts have highlighted the risks to worker well-being associated with RAI content work, far less attention has been paid to how these risks are communicated to workers by task designers or individuals who design and post RAI tasks. Existing transparency frameworks and guidelines, such as model cards, datasheets, and crowdworksheets, focus on documenting model information and dataset collection process

Togedule: Scheduling Meetings with Large Language Models and Adaptive Representations of Group Availability
arXiv:2505.01000v5 Announce Type: replace Abstract: Scheduling is a perennial-and often challenging-problem for many groups. Existing tools are mostly static, showing an identical set of choices to everyone, regardless of the current status of attendees' inputs and preferences. In this paper, we propose Togedule, an adaptive scheduling tool that uses large language models to dynamically adjust the pool of choices and their presentation format. With the initial prototype, we conducted a formative study (N=10) and identified the potential benefits and risks of such an adaptive scheduling tool. Then, after enhancing the system, we conducted two controlled experiments, one each for attendees and organizers (total N=66). For each experiment, we compared scheduling with verbal messages, shared c
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!