Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessPerplexity launches Secure Intelligence Institute to advance AI security, privacy, and safety research - Moneycontrol.comGoogle News: AI SafetyAnthropic Source Code Leak Exposes AI Security Logic Before $350B IPO - startupfortune.comGoogle News: ClaudeBoy, 16, takes his own life after chilling ChatGPT question and 'farewell' texts - Daily StarGoogle News: ChatGPTGiving up on EA after 13 yearsLessWrong AIThe End of the "I Am Not a Robot" Box: Why Your Next Login Will Require 5 SquatsDEV CommunityInstagram DMs to Amazon Connect ChatDEV CommunityThe Nines Are Lying to You: What 99.9% Uptime Actually CostsDEV CommunityThe jury verdicts against Meta and YouTube recognized some platform design features as defective, distinct from what Section 230 was created to protect (Casey Newton/Platformer)TechmemeAnthropic code leak sparks renewed concerns over AI security and operational risks - CXO DigitalpulseGoogle News: AI SafetyBefore You Upgrade Hardware, Fix the SoftwareDEV Community2026년, Postman 버릴 때? Axios npm 공격 후 안전한 API 테스트 및 마이그레이션DEV CommunityAnthropic accidentally leaks part of Claude Code source - Latest news from AzerbaijanGoogle News: ClaudeBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessPerplexity launches Secure Intelligence Institute to advance AI security, privacy, and safety research - Moneycontrol.comGoogle News: AI SafetyAnthropic Source Code Leak Exposes AI Security Logic Before $350B IPO - startupfortune.comGoogle News: ClaudeBoy, 16, takes his own life after chilling ChatGPT question and 'farewell' texts - Daily StarGoogle News: ChatGPTGiving up on EA after 13 yearsLessWrong AIThe End of the "I Am Not a Robot" Box: Why Your Next Login Will Require 5 SquatsDEV CommunityInstagram DMs to Amazon Connect ChatDEV CommunityThe Nines Are Lying to You: What 99.9% Uptime Actually CostsDEV CommunityThe jury verdicts against Meta and YouTube recognized some platform design features as defective, distinct from what Section 230 was created to protect (Casey Newton/Platformer)TechmemeAnthropic code leak sparks renewed concerns over AI security and operational risks - CXO DigitalpulseGoogle News: AI SafetyBefore You Upgrade Hardware, Fix the SoftwareDEV Community2026년, Postman 버릴 때? Axios npm 공격 후 안전한 API 테스트 및 마이그레이션DEV CommunityAnthropic accidentally leaks part of Claude Code source - Latest news from AzerbaijanGoogle News: Claude

LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.26683v1 Announce Type: cross Abstract: Retrieving relevant evidence from visually rich documents such as textbooks, technical reports, and manuals is challenging due to long context, complex layouts, and weak lexical overlap between user questions and supporting pages. We propose LITTA, a query-expansion-centric retrieval framework for evidence page retrieval that improves multimodal document retrieval without retriever retraining. Given a user query, LITTA generates complementary query variants using a large language model and retrieves candidate pages for each variant using a froz — Seonok Kim

View PDF HTML (experimental)

Abstract:Retrieving relevant evidence from visually rich documents such as textbooks, technical reports, and manuals is challenging due to long context, complex layouts, and weak lexical overlap between user questions and supporting pages. We propose LITTA, a query-expansion-centric retrieval framework for evidence page retrieval that improves multimodal document retrieval without retriever retraining. Given a user query, LITTA generates complementary query variants using a large language model and retrieves candidate pages for each variant using a frozen vision retriever with late-interaction scoring. Candidates from expanded queries are then aggregated through reciprocal rank fusion to improve evidence coverage and reduce sensitivity to any single phrasing. This simple test-time strategy significantly improves retrieval robustness while remaining compatible with existing multimodal embedding indices. We evaluate LITTA on visually grounded document retrieval tasks across three domains: computer science, pharmaceuticals, and industrial manuals. Multi-query retrieval consistently improves top-k accuracy, recall, and MRR compared to single-query retrieval, with particularly large gains in domains with high visual and semantic variability. Moreover, the accuracy-efficiency trade-off is directly controllable by the number of query variants, making LITTA practical for deployment under latency constraints. These results demonstrate that query expansion provides a simple yet effective mechanism for improving visually grounded multimodal retrieval.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.26683 [cs.IR]

(or arXiv:2603.26683v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.26683

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Seonok Kim [view email] [v1] Tue, 10 Mar 2026 13:25:39 UTC (982 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
LITTA: Late…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 229 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers