Goodness-of-pronunciation without phoneme time alignment
In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme — Jeremy H. M. Wong, Nancy F. Chen
View PDF HTML (experimental)
Abstract:In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Cite as: arXiv:2603.25150 [cs.CL]
(or arXiv:2603.25150v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.25150
arXiv-issued DOI via DataCite
Submission history
From: Jeremy Heng Meng Wong [view email] [v1] Thu, 26 Mar 2026 08:12:19 UTC (199 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivAustralian govt partners Anthropic on AI safety, research and infrastructure - Telecompaper
<a href="https://news.google.com/rss/articles/CBMiugFBVV95cUxNUjhfY3dKRFdBV3hIOW1PMXE4M1g2SGZkbjYxTWozbFBKdW1HN0RrU0VfdVRfbEt6MW0tRUhiQWsxUXppMzlnQk10SnVTZjY5MXBNVlYzWEtOeUZYSXBqTFZZb2lqX2hnRlZjV0pWMzkzNE5CNDl0TWV2MEczVHI2eGVIR0pZeFJTUE90VFNWSUkxdnloZzlYcHB4b0VRdC1QcXYxME0wRlFGVnAwaGhiYURNT1lYRkdOeEE?oc=5" target="_blank">Australian govt partners Anthropic on AI safety, research and infrastructure</a> <font color="#6f6f6f">Telecompaper</font>
Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxOTGxaVmNpenBkbkRYZmhsOG9MRTF4YTk0TEEwanVSUS05X2w5TE9sY1BuenFOWlozaElZWTUxVzZYTFVGTUJ3QjNpMmV6d1AtNVhjUEVMbF9Cdy1GSnFpUnVQOVN6ZzJjdzRWWnNBXzRYOEdRUW9xdEpPMFlHUmV3OFBIV1hBUmc0and2MjNZNjJIVTZqeTd6V2Q2NWlydkhDN0xEa1NyUmYtNXkxb3NvUjZWelAzQndPeDRjY2J0RHYzNi1wTW1FeWwxd2hkTWJXeHJjaENTYXFPb3VtQTlQWFFZSXVENXhMaWpJTTN1bVl1bXVUY0dFVXluTnJkQXpKNmVJdUZEZ2I3WVdsS1dnaGdrZGlwZjJFZGtqaGo3X1ZBNEltcXZna1g4c3Z3WXlqWks5Yl9SMjJyQTVCM0trNkZuV1NSUF93YzdHdXJwWlVtQ3VrcUlsTDNQZ1NEOTk5NkhVWGF6TWVpMmJ4NXNLMWJPOVFpU3lNMW52Z0lEaWN5aXJwNU9VbXR6d0VsOHo4b00wNDFrYmlRZ3BLTWphbVMtVGtTVTFoX2hYQmtjaG1GVkJSbHVzdw?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>
Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ
<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxOdkxzRng0QzhXSGNBR21UU0k1XzZqR3VTdFpXeDhEdUlCT2s2WTRPYjhZZ2c0cGktY3ZCUFBlS3hzU3pURkhvTW0yTFhmSE9iMWk1Uy0xRXRzZzlDU2FrUzc5M1cxZnhEM3I3NThqOEFydmxxYm1UOVNOTEJBalZwWnFLd21YTXJHSDFtQmhqUWU0aS1fNW1nTmo5VXBER21XQWZQaVhuVzNMRUN3eTB5Tkk1eHEwX1ZxNGprMWgyT2Y4cGVIa1lTb0FkRnV2N24tNXRJcVQxaUtDSFJKQmpJUE0td3M2LWJTLXRWRVZOeW5SYUF5Q01SenVQeFZwR0Z0LVd5d1dPbjBYZm1tYm0yR1J5T0dVS0VHVDdRYy1WY2RLMm4zZEVpelRUeFA3WjZRV1YtY3NDeEpaX1ROa3l3eUx4RC1DSnV6djJtSEE2T3JRalduZU92TkJacEN3ZWJ5MkRlZlVXd3k1by1saXNCdWxXSmFQUDRDaVFIVFZUNXlUdDc4VmNBVVZqMG81ZmJ3eTBDYnA2U00yaXk5aEpfaGtjWTh0RXh6SXNDWTZLZENzbVEwWWZ5Zg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> <font color="#6f6f6f">WSJ</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Australian govt partners Anthropic on AI safety, research and infrastructure - Telecompaper
<a href="https://news.google.com/rss/articles/CBMiugFBVV95cUxNUjhfY3dKRFdBV3hIOW1PMXE4M1g2SGZkbjYxTWozbFBKdW1HN0RrU0VfdVRfbEt6MW0tRUhiQWsxUXppMzlnQk10SnVTZjY5MXBNVlYzWEtOeUZYSXBqTFZZb2lqX2hnRlZjV0pWMzkzNE5CNDl0TWV2MEczVHI2eGVIR0pZeFJTUE90VFNWSUkxdnloZzlYcHB4b0VRdC1QcXYxME0wRlFGVnAwaGhiYURNT1lYRkdOeEE?oc=5" target="_blank">Australian govt partners Anthropic on AI safety, research and infrastructure</a> <font color="#6f6f6f">Telecompaper</font>

Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method
arXiv:2603.29245v1 Announce Type: new Abstract: Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height

Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices
arXiv:2603.29375v1 Announce Type: new Abstract: Spacecraft anomaly detection is critical for mission safety, yet deploying sophisticated models on-board presents significant challenges due to hardware constraints. This paper investigates three approaches for spacecraft telemetry anomaly detection -- forecasting & threshold, direct classification, and image classification -- and optimizes them for edge deployment using multi-objective neural architecture optimization on the European Space Agency Anomaly Dataset. Our baseline experiments demonstrate that forecasting & threshold achieves superior detection performance (92.7% Corrected Event-wise F0.5-score (CEF0.5)) [1] compared to alternatives. Through Pareto-optimal architecture optimization, we dramatically reduced computational requiremen

Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention
arXiv:2603.29194v1 Announce Type: new Abstract: Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, and 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long-term retention and reasoning stability under constrained conte
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!