Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessMicrosoft says it is on track to invest $5.5B in cloud and AI infrastructure in Singapore through 2029, after announcing plans to invest $1B+ in Thailand (Kimberley Kao/Wall Street Journal)TechmemePower Pages Authentication Methods: The Complete Guide (2026)DEV CommunityHITEK AI launches a bundle of solutions to support compliance with new Dubai Law on building quality & safety - ZAWYAGoogle News AI UAEClaude Code Unpacked: what the visual guide reveals about the architectureDEV CommunityExolane Review: What It Gets Right on Custody, Funding Caps, and RiskDEV CommunityGitHub Agentic Workflows: AI Agents Are Coming for Your Repository Maintenance Tasks (And That's a Good Thing)DEV CommunityAlibaba Launches XuanTie C950 CPU for Agentic AIEE TimesThe Illusion of Data Custody in Legal AI — and the Architecture I Built to Replace ItDEV CommunityI use these 5 simple ‘ChatGPT codes’ every day — and they instantly improve my results - TechRadarGoogle News: ChatGPTTurboQuant, KIVI, and the Real Cost of Long-Context KV CacheDEV CommunityWhy ChatGPT Cites Your Competitors (Not You)DEV CommunityIntroducing Anti-Moral RealismLessWrong AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessMicrosoft says it is on track to invest $5.5B in cloud and AI infrastructure in Singapore through 2029, after announcing plans to invest $1B+ in Thailand (Kimberley Kao/Wall Street Journal)TechmemePower Pages Authentication Methods: The Complete Guide (2026)DEV CommunityHITEK AI launches a bundle of solutions to support compliance with new Dubai Law on building quality & safety - ZAWYAGoogle News AI UAEClaude Code Unpacked: what the visual guide reveals about the architectureDEV CommunityExolane Review: What It Gets Right on Custody, Funding Caps, and RiskDEV CommunityGitHub Agentic Workflows: AI Agents Are Coming for Your Repository Maintenance Tasks (And That's a Good Thing)DEV CommunityAlibaba Launches XuanTie C950 CPU for Agentic AIEE TimesThe Illusion of Data Custody in Legal AI — and the Architecture I Built to Replace ItDEV CommunityI use these 5 simple ‘ChatGPT codes’ every day — and they instantly improve my results - TechRadarGoogle News: ChatGPTTurboQuant, KIVI, and the Real Cost of Long-Context KV CacheDEV CommunityWhy ChatGPT Cites Your Competitors (Not You)DEV CommunityIntroducing Anti-Moral RealismLessWrong AI

Calibrated Fusion for Heterogeneous Graph-Vector Retrieval in Multi-Hop QA

arXiv cs.IRby Andre BacellarApril 1, 20261 min read0 views
Source Quiz

arXiv:2603.28886v1 Announce Type: new Abstract: Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiH

View PDF HTML (experimental)

Abstract:Graph-augmented retrieval combines dense similarity with graph-based relevance signals such as Personalized PageRank (PPR), but these scores have different distributions and are not directly comparable. We study this as a score calibration problem for heterogeneous retrieval fusion in multi-hop question answering. Our method, PhaseGraph, maps vector and graph scores to a common unit-free scale using percentile-rank normalization (PIT) before fusion, enabling stable combination without discarding magnitude information. Across MuSiQue and 2WikiMultiHopQA, calibrated fusion improves held-out last-hop retrieval on HippoRAG2-style benchmarks: LastHop@5 increases from 75.1% to 76.5% on MuSiQue (8W/1L, p=0.039) and from 51.7% to 53.6% on 2WikiMultiHopQA (11W/2L, p=0.023), both on independent held-out test splits. A theory-driven ablation shows that percentile-based calibration is directionally more robust than min-max normalization on both tune and test splits (1W/6L, p=0.125), while Boltzmann weighting performs comparably to linear fusion after calibration (0W/3L, p=0.25). These results suggest that score commensuration is a robust design choice, and the exact post-calibration operator appears to matter less on these benchmarks.

Comments: 10 pages, 5 figures

Subjects:

Information Retrieval (cs.IR); Machine Learning (cs.LG)

ACM classes: H.3.3

Cite as: arXiv:2603.28886 [cs.IR]

(or arXiv:2603.28886v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2603.28886

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Andre Bacellar [view email] [v1] Mon, 30 Mar 2026 18:13:01 UTC (123 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkannouncestudy

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Calibrated …benchmarkannouncestudyarxivarXiv cs.IR

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 148 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers