Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessTikTok ran ads for AI apps that let users undress strangersBusiness InsiderEnd of an era: Elon Musk says Tesla is no longer producing the Model S and XBusiness InsiderOpenAI's new partner wants to build ads that can chat with youBusiness InsiderQ1 2026 Shatters Venture Funding Records As AI Boom Pushes Startup Investment To Nearly $300BCrunchbase NewsMeet 'Dobby': The AI agent that could kill the app economyBusiness InsiderThis company is turning YouTube videos into TV shows as streamers chase Gen AlphaBusiness InsiderWhat to expect from WWDC 2026EngadgetThe gig workers who are training humanoid robots at homeMIT Technology Review AITech creators are getting the star treatment at a new talent management firmBusiness InsiderBaidu’s robotaxis froze in traffic creating chaosThe Verge AI9 companies that have done AI-related layoffsBusiness InsiderSlack's upgraded AI can analyze how you workEngadgetBlack Hat USADark ReadingBlack Hat AsiaAI BusinessTikTok ran ads for AI apps that let users undress strangersBusiness InsiderEnd of an era: Elon Musk says Tesla is no longer producing the Model S and XBusiness InsiderOpenAI's new partner wants to build ads that can chat with youBusiness InsiderQ1 2026 Shatters Venture Funding Records As AI Boom Pushes Startup Investment To Nearly $300BCrunchbase NewsMeet 'Dobby': The AI agent that could kill the app economyBusiness InsiderThis company is turning YouTube videos into TV shows as streamers chase Gen AlphaBusiness InsiderWhat to expect from WWDC 2026EngadgetThe gig workers who are training humanoid robots at homeMIT Technology Review AITech creators are getting the star treatment at a new talent management firmBusiness InsiderBaidu’s robotaxis froze in traffic creating chaosThe Verge AI9 companies that have done AI-related layoffsBusiness InsiderSlack's upgraded AI can analyze how you workEngadget

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2510.24434v2 Announce Type: replace Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining — Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot

View PDF HTML (experimental)

Abstract:The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs. To investigate the practical utility of the dataset, we fine-tune 14 smaller-scale LLMs ($\leq$15B parameters) on LuxIT and evaluate them on standardized Luxembourgish proficiency exams and five downstream NLP tasks. Training on LuxIT yields a mean accuracy change of +5.37 percentage points on language exams across all 14 models, with 12 of 14 showing improvement. On NLP downstream tasks, 9 of 14 models improve in macro-averaged F1, though gains on the two benchmarks do not systematically correlate. These results underscore the feasibility of leveraging monolingual synthetic data to improve LLM capabilities in low-resource languages, while highlighting the multi-faceted nature of language proficiency.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2510.24434 [cs.CL]

(or arXiv:2510.24434v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2510.24434

arXiv-issued DOI via DataCite

Submission history

From: Julian Valline [view email] [v1] Tue, 28 Oct 2025 14:02:55 UTC (157 KB) [v2] Mon, 30 Mar 2026 15:27:32 UTC (2,607 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
LuxIT: A Lu…researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 170 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers