Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessAI Agents Are Coming for Your Waiting Room. That’s Just the Start. - CDOTrendsGNews AI agenticAI Has Already Decided: First-Party Data Will Define Advertising’s Agentic Era - AdExchangerGNews AI agenticDefending Habit Streakslesswrong.comChinese-made drone shot in Iran, Australia’s diplomacy pivot: 5 weekend reads you missedSCMP Tech (Asia AI)How NLP Actually Understands Text?Medium AIXENONOSTRA RESEARCH NOTES ALGEBROS: An Algebraic Meta-Language for Code Structure Extraction and…Medium AI18 Specific Tutorial Ideas for AI Voice Integration Using Vapi and TwilioDev.to AIUI DESIGNERS IN TROUBLEMedium AIEstimates of the expected utility gain of AI Safety Researchlesswrong.comMastering Python for Machine Learning: A Practical, No-Nonsense RoadmapMedium AII Audited 13 AI Agent Platforms for Security Misconfigurations — Here's the Open-Source Scanner I BuiltDev.to AIFrom Reality to Writing: Why I Explore Technology, Identity and Human BehaviorMedium AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessAI Agents Are Coming for Your Waiting Room. That’s Just the Start. - CDOTrendsGNews AI agenticAI Has Already Decided: First-Party Data Will Define Advertising’s Agentic Era - AdExchangerGNews AI agenticDefending Habit Streakslesswrong.comChinese-made drone shot in Iran, Australia’s diplomacy pivot: 5 weekend reads you missedSCMP Tech (Asia AI)How NLP Actually Understands Text?Medium AIXENONOSTRA RESEARCH NOTES ALGEBROS: An Algebraic Meta-Language for Code Structure Extraction and…Medium AI18 Specific Tutorial Ideas for AI Voice Integration Using Vapi and TwilioDev.to AIUI DESIGNERS IN TROUBLEMedium AIEstimates of the expected utility gain of AI Safety Researchlesswrong.comMastering Python for Machine Learning: A Practical, No-Nonsense RoadmapMedium AII Audited 13 AI Agent Platforms for Security Misconfigurations — Here's the Open-Source Scanner I BuiltDev.to AIFrom Reality to Writing: Why I Explore Technology, Identity and Human BehaviorMedium AI
AI NEWS HUBbyEIGENVECTOREigenvector

AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.26511v1 Announce Type: cross Abstract: Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt — Afonso Simpl\'icio, Gon\c{c}alo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira, Giuseppe Attanasio, Duarte M. Alves, In\^es Calvo, In\^es Vieira, Rui Guerra, James Furtado, Beatriz Canaverde, Iago Paulo, Vasco Ramos, Diogo Gl\'oria-Silva, Miguel Faria, Marcos Treviso, Daniel Gomes, Pedro Gomes, David Semedo, Andr\'e Martins, Jo\~ao Magalh\~aes

Authors:Afonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira, Giuseppe Attanasio, Duarte M. Alves, Inês Calvo, Inês Vieira, Rui Guerra, James Furtado, Beatriz Canaverde, Iago Paulo, Vasco Ramos, Diogo Glória-Silva, Miguel Faria, Marcos Treviso, Daniel Gomes, Pedro Gomes, David Semedo, André Martins, João Magalhães

View PDF HTML (experimental)

Abstract:Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.

Comments: PROPOR 2026 - The 17th International Conference on Computational Processing of Portuguese

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

ACM classes: I.2.7

Cite as: arXiv:2603.26511 [cs.CL]

(or arXiv:2603.26511v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.26511

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Rafael Ferreira [view email] [v1] Fri, 27 Mar 2026 15:22:33 UTC (137 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
AMALIA Tech…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 271 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers