Models model language model benchmark release announce open-source

LLM Probe: Evaluating LLMs for Low-Resource Languages

arXiv cs.CLby Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang NejdlApril 1, 20262 min read0 views

arXiv:2603.29517v1 Announce Type: new Abstract: Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset compris

View PDF

Abstract:Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.

Comments: 11 pages, 6 tables

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2603.29517 [cs.CL]

(or arXiv:2603.29517v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.29517

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Hailay Kidu Teklehaymanot [view email] [v1] Tue, 31 Mar 2026 10:03:38 UTC (104 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2603.29517

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Open Source AILive

How to Run Local AI Agents on Consumer‑Grade Hardware: A Practical Guide

How to Run Local AI Agents on Consumer‑Grade Hardware: A Practical Guide Want to run powerful AI agents without the endless API bills of cloud services? The good news is you don’t need a data‑center‑grade workstation. A single modern consumer GPU is enough to host capable 9B‑parameter models like qwen3.5:9b, giving you private, low‑latency inference at a fraction of the cost. This article walks you through the exact hardware specs, VRAM needs, software installation steps, and budget‑friendly upgrade paths so you can get a local agent up and running today—no PhD required. Why a Consumer GPU Is Enough It’s a common myth that you must buy a professional‑grade card (think RTX A6000 or multiple GPUs linked via NVLink) to run LLMs locally. In reality, for 9B‑class models the sweet spot lies in t

Dev.to AI

10m22 minutes ago

ProductsLive

9 Reasons qwen3.5:9B Outshines Larger Models for Local Agents on RTX 5070 Ti

9 Reasons qwen3.5:9B Outshines Larger Models for Local Agents on RTX 5070 Ti When I compared five models across 18 tests, I found that parameter count isn't the decisive factor for local Agents—it's structured tool calling, chain of thought control, and smooth hardware loading that matter. Here's why qwen3.5:9B stands out on an RTX 5070 Ti: 1. Structured Tool Calling Saves Development Complexity Model Tool Calls Format qwen3.5:9B Independent tool_calls qwen2.5-coder:14B Buried in plain text qwen2.5:14B Buried in plain text Test Prompt: "Please use a tool to list the /tmp directory." # Expected structured response from qwen3.5:9B { " tool_calls " : [ { " tool_id " : " file_system " , " input " : { " path " : " /tmp " } } ] } Larger models required parsing layers, increasing error rates. qwe

Dev.to AI

3m21 minutes ago

ModelsLive

DeepSeek to use Huawei chips for new V4 model: report - Seeking Alpha

DeepSeek to use Huawei chips for new V4 model: report Seeking Alpha

GNews AI Huawei

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 139 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

DeepSeek to use Huawei chips for new V4 model: report - Seeking Alpha

DeepSeek to use Huawei chips for new V4 model: report Seeking Alpha

GNews AI Huawei

1mabout 2 hours ago

ModelsLive

Я автоматизировал 80% задач и уволил себя сам

Ты когда-нибудь задумывался, каково это - уволить самого себя и быть счастливым? Я это сделал. 80% моей работы теперь выполняет искусственный интеллект, и, честно говоря, я не чувствую себя ни капельки уволенным. Вместо этого я занимаюсь тем, что меня действительно вдохновляет. Мой рабочий день раньше был как бесконечный цикл однообразных задач: шесть часов рутины, состоящей из отчётов, писем, таблиц и созвонов, которые, казалось, были просто ради того, чтобы заполнить время. Занятость не означала продуктивность - это была типичная ловушка «обезьяньего труда». Каждое утро начиналось с надеждой, что сегодня всё сложится иначе, а заканчивалось разочарованием. Я собрал AI для автоматизации работы-промпты в PDF. Забери бесплатно в Telegram (в закрепе): https://t.me/yevheniirozov Первым шагом н

Dev.to AI

3m7 minutes ago

ModelsLive

DeepSeek Uses Huawei Chips For V4 Model - Let's Data Science

DeepSeek Uses Huawei Chips For V4 Model Let's Data Science

GNews AI Huawei

1mabout 2 hours ago

ModelsLive

We Cut Our LLM Inference Bill by 73% Without Degrading Clinical Accuracy

$14,200 per month. That was our LLM inference bill when our clinical platform hit 3,000 daily active users processing patient intake… Continue reading on Medium »

Medium AI

1m11 minutes ago