Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessPower Pages Authentication Methods: The Complete Guide (2026)DEV CommunityClaude Code Unpacked: what the visual guide reveals about the architectureDEV CommunityExolane Review: What It Gets Right on Custody, Funding Caps, and RiskDEV CommunityGitHub Agentic Workflows: AI Agents Are Coming for Your Repository Maintenance Tasks (And That's a Good Thing)DEV CommunityAlibaba Launches XuanTie C950 CPU for Agentic AIEE TimesThe Illusion of Data Custody in Legal AI — and the Architecture I Built to Replace ItDEV CommunityTurboQuant, KIVI, and the Real Cost of Long-Context KV CacheDEV CommunityWhy ChatGPT Cites Your Competitors (Not You)DEV CommunityIntroducing Anti-Moral RealismLessWrong AIFrom idea to live web app in minutes with Spektrum. An AI-powered web app builder for MVPs, rapid prototyping, and full-stack JavaScript apps. Skip setup, generate real products, and deploy instantly without infrastructure headaches. 🔥DEV CommunityAnthropic Just Proved That Codebase Governance Is Now the #1 Priority for Every Engineering OrgDEV CommunityThe history of Apple in photos, from the early Steve Jobs era to the iPhone launch to its 50-year markBusiness InsiderBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessPower Pages Authentication Methods: The Complete Guide (2026)DEV CommunityClaude Code Unpacked: what the visual guide reveals about the architectureDEV CommunityExolane Review: What It Gets Right on Custody, Funding Caps, and RiskDEV CommunityGitHub Agentic Workflows: AI Agents Are Coming for Your Repository Maintenance Tasks (And That's a Good Thing)DEV CommunityAlibaba Launches XuanTie C950 CPU for Agentic AIEE TimesThe Illusion of Data Custody in Legal AI — and the Architecture I Built to Replace ItDEV CommunityTurboQuant, KIVI, and the Real Cost of Long-Context KV CacheDEV CommunityWhy ChatGPT Cites Your Competitors (Not You)DEV CommunityIntroducing Anti-Moral RealismLessWrong AIFrom idea to live web app in minutes with Spektrum. An AI-powered web app builder for MVPs, rapid prototyping, and full-stack JavaScript apps. Skip setup, generate real products, and deploy instantly without infrastructure headaches. 🔥DEV CommunityAnthropic Just Proved That Codebase Governance Is Now the #1 Priority for Every Engineering OrgDEV CommunityThe history of Apple in photos, from the early Steve Jobs era to the iPhone launch to its 50-year markBusiness Insider

Quid est VERITAS? A Modular Framework for Archival Document Analysis

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28108v1 Announce Type: cross Abstract: The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into f — Leonardo Bassanini, Ludovico Biancardi, Alfio Ferrara, Andrea Gamberini, Sergio Picascia, Folco Vaglienti

View PDF HTML (experimental)

Abstract:The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline's output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.

Comments: to be published in: LLMs4SSH: Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities, organized within the 15th Language Resource and Evaluation Conference (2026)

Subjects:

Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as: arXiv:2603.28108 [cs.DL]

(or arXiv:2603.28108v1 [cs.DL] for this version)

https://doi.org/10.48550/arXiv.2603.28108

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sergio Picascia [view email] [v1] Mon, 30 Mar 2026 07:14:51 UTC (3,663 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Quid est VE…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 155 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers