LLM Probe: Evaluating LLMs for Low-Resource Languages
arXiv:2603.29517v1 Announce Type: new Abstract: Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset compris
View PDF
Abstract:Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.
Comments: 11 pages, 6 tables
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2603.29517 [cs.CL]
(or arXiv:2603.29517v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.29517
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Hailay Kidu Teklehaymanot [view email] [v1] Tue, 31 Mar 2026 10:03:38 UTC (104 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
How to Run Local AI Agents on Consumer‑Grade Hardware: A Practical Guide
How to Run Local AI Agents on Consumer‑Grade Hardware: A Practical Guide Want to run powerful AI agents without the endless API bills of cloud services? The good news is you don’t need a data‑center‑grade workstation. A single modern consumer GPU is enough to host capable 9B‑parameter models like qwen3.5:9b, giving you private, low‑latency inference at a fraction of the cost. This article walks you through the exact hardware specs, VRAM needs, software installation steps, and budget‑friendly upgrade paths so you can get a local agent up and running today—no PhD required. Why a Consumer GPU Is Enough It’s a common myth that you must buy a professional‑grade card (think RTX A6000 or multiple GPUs linked via NVLink) to run LLMs locally. In reality, for 9B‑class models the sweet spot lies in t

9 Reasons qwen3.5:9B Outshines Larger Models for Local Agents on RTX 5070 Ti
9 Reasons qwen3.5:9B Outshines Larger Models for Local Agents on RTX 5070 Ti When I compared five models across 18 tests, I found that parameter count isn't the decisive factor for local Agents—it's structured tool calling, chain of thought control, and smooth hardware loading that matter. Here's why qwen3.5:9B stands out on an RTX 5070 Ti: 1. Structured Tool Calling Saves Development Complexity Model Tool Calls Format qwen3.5:9B Independent tool_calls qwen2.5-coder:14B Buried in plain text qwen2.5:14B Buried in plain text Test Prompt: "Please use a tool to list the /tmp directory." # Expected structured response from qwen3.5:9B { " tool_calls " : [ { " tool_id " : " file_system " , " input " : { " path " : " /tmp " } } ] } Larger models required parsing layers, increasing error rates. qwe
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Я автоматизировал 80% задач и уволил себя сам
Ты когда-нибудь задумывался, каково это - уволить самого себя и быть счастливым? Я это сделал. 80% моей работы теперь выполняет искусственный интеллект, и, честно говоря, я не чувствую себя ни капельки уволенным. Вместо этого я занимаюсь тем, что меня действительно вдохновляет. Мой рабочий день раньше был как бесконечный цикл однообразных задач: шесть часов рутины, состоящей из отчётов, писем, таблиц и созвонов, которые, казалось, были просто ради того, чтобы заполнить время. Занятость не означала продуктивность - это была типичная ловушка «обезьяньего труда». Каждое утро начиналось с надеждой, что сегодня всё сложится иначе, а заканчивалось разочарованием. Я собрал AI для автоматизации работы-промпты в PDF. Забери бесплатно в Telegram (в закрепе): https://t.me/yevheniirozov Первым шагом н



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!