Research Papers research paper arxiv nlp language-models

Approaches to Analysing Historical Newspapers Using LLMs

arXivMarch 30, 202610 min read0 views

arXiv:2603.25051v2 Announce Type: replace Abstract: This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared co — Filip Dobrani\'c, Tina Munda, Oliver Peji\'c, Vojko Gorjanc, Uro\v{s} \v{S}majdek, David Bordon, Jakob Lenardi\v{c}, Tja\v{s}a Konov\v{s}ek, Kristina Pahor de Maiti Tekav\v{c}i\v{c}, Ciril Bohak, Darja Fi\v{s}er

Authors:Filip Dobranić, Tina Munda, Oliver Pejić, Vojko Gorjanc, Uroš Šmajdek, David Bordon, Jakob Lenardič, Tjaša Konovšek, Kristina Pahor de Maiti Tekavčič, Ciril Bohak, Darja Fišer

View PDF HTML (experimental)

Abstract:This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2603.25051 [cs.CL]

(or arXiv:2603.25051v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2603.25051

arXiv-issued DOI via DataCite

Submission history

From: Ciril Bohak [view email] [v1] Thu, 26 Mar 2026 05:38:30 UTC (3,483 KB) [v2] Fri, 27 Mar 2026 09:09:00 UTC (3,476 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25051

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

First time NeurIPS. How different is it from low-ranked conferences? [D]

I'm a PhD student and already published papers in A/B ranked paper (10+). My field of work never allowed me to work on something really exciting and a core A* conference. But finally after years I think I have work worthy of some discussion at the top venue. I'm referring to papers (my field and top papers) from previous editions and I notice that there's a big difference on how people write, how they put their message on table and also it is too theoretical sometimes. Are there any golden rules people follow who frequently get into these conferences? Should I be soft while making novelty claims? Also those who moved from submitting to niche-conferences to NeurIPS/ICML/CVPR, did you change your approach? My field is imaging in healthcare. submitted by /u/ade17_in [link] [comments]

Reddit r/MachineLearning

1mabout 1 hour ago

Frontier ResearchLive

AI can describe human experiences but lacks experience in an actual ‘body.’ UCLA researchers say understanding this ‘body gap’ may matter for safety - UCLA Health

AI can describe human experiences but lacks experience in an actual ‘body.’ UCLA researchers say understanding this ‘body gap’ may matter for safety UCLA Health