Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessSources: Mercor asked professionals in fields like entertainment to sell their prior work materials for AI training, even if the IP could belong to ex-employers (Katherine Bindley/Wall Street Journal)TechmemeStop Using Robotic AI Voices — Here’s How to Make Them Sound Human (For Free)Medium AILangChain4j TokenWindowChatMemory Crash: IndexOutOfBoundsException Explained and FixedMedium AIGoogle TurboQuant Codes explainedMedium AIStop Storing Data in CSV Like It’s 2010-Apache Parquet Will Change How You Think About StorageMedium AIBest HSE Software in 2026: Top 10 Platforms for Safety ProfessionalsMedium AIPython OperatorsMedium AIPsyche 2.0? Unconsciousness, Preconsciousness, Consciousness, and ComputsciousnessMedium AII Changed My Mind about Error-Correcting Debate, Misogyny and More: Updates from a Former Student of David DeutschLessWrongHow I Would Start From $0 Today Using AI and Affiliate MarketingMedium AIRegistering a Prediction Based on Anthropic's "Emotions" PaperLessWrongTragedy to Triumph: A 20 Year Problem Solved In One Meaningful Conversation by Abigail Rose…Medium AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessSources: Mercor asked professionals in fields like entertainment to sell their prior work materials for AI training, even if the IP could belong to ex-employers (Katherine Bindley/Wall Street Journal)TechmemeStop Using Robotic AI Voices — Here’s How to Make Them Sound Human (For Free)Medium AILangChain4j TokenWindowChatMemory Crash: IndexOutOfBoundsException Explained and FixedMedium AIGoogle TurboQuant Codes explainedMedium AIStop Storing Data in CSV Like It’s 2010-Apache Parquet Will Change How You Think About StorageMedium AIBest HSE Software in 2026: Top 10 Platforms for Safety ProfessionalsMedium AIPython OperatorsMedium AIPsyche 2.0? Unconsciousness, Preconsciousness, Consciousness, and ComputsciousnessMedium AII Changed My Mind about Error-Correcting Debate, Misogyny and More: Updates from a Former Student of David DeutschLessWrongHow I Would Start From $0 Today Using AI and Affiliate MarketingMedium AIRegistering a Prediction Based on Anthropic's "Emotions" PaperLessWrongTragedy to Triumph: A 20 Year Problem Solved In One Meaningful Conversation by Abigail Rose…Medium AI
AI NEWS HUBbyEIGENVECTOREigenvector

The Degree of Language Diacriticity and Its Effect on Tasks

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.27653v1 Announce Type: new Abstract: Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there's no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using cor — Adi Cohen, Yuval Pinter

View PDF HTML (experimental)

Abstract:Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified across scripts. While prior work has examined diacritics in individual languages, there's no cross-linguistic, data-driven framework for measuring the degree to which writing systems rely on them and how this affects downstream tasks. We propose a data-driven framework for quantifying diacritic complexity using corpus-level, information-theoretic metrics that capture the frequency, ambiguity, and structural diversity of character-diacritic combinations. We compute these metrics over 24 corpora in 15 languages, spanning both single- and multi-diacritic scripts. We then examine how diacritic complexity correlates with performance on the task of diacritics restoration, evaluating BERT- and RNN-based models. We find that across languages, higher diacritic complexity is strongly associated with lower restoration accuracy. In single-diacritic scripts, where character-diacritic combinations are more predictable, frequency-based and structural measures largely align. In multi-diacritic scripts, however, structural complexity exhibits the strongest association with performance, surpassing frequency-based measures. These findings show that measurable properties of diacritic usage influence the performance of diacritic restoration models, demonstrating that orthographic complexity is not only descriptive but functionally relevant for modeling.

Comments: Accepted to CAWL 2026

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2603.27653 [cs.CL]

(or arXiv:2603.27653v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.27653

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Adi Cohen [view email] [v1] Sun, 29 Mar 2026 12:04:38 UTC (33 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
The Degree …researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 177 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!