PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering
arXiv:2603.29085v1 Announce Type: new Abstract: Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. A
View PDF HTML (experimental)
Abstract:Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR$^2$-RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR$^2$-RAG achieves up to \textbf{23.5%} higher accuracy, with retrieval gains of up to \textbf{10.5%} in NDCG.
Comments: 11 pages, 2 figures
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.29085 [cs.AI]
(or arXiv:2603.29085v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.29085
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Xingyu Li [view email] [v1] Mon, 30 Mar 2026 23:52:54 UTC (704 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment - Palo Alto Online
<a href="https://news.google.com/rss/articles/CBMiywFBVV95cUxOa1ZrSUQyY0JEbXEtUDFveWVUMV9SOWxZd05LM1AtOEFkc3d0QlN1X0RuSzd1RGNSM3BCN0pITlpCRUl5UmhWaWpGTXE0Q0ZWcFZqRTA2X1dEcERldk1wZnVWR2hXdGtKUDV0cmxQTVVBNVFDc1FLNXpWM3BYeEI3UE5QQWtvVmhtSmFsV0pqdF9feVhzVHRTbGtuTGNqNjJubGFNWjJ4d2lpUVFtOFA2cm1zYklkQW9vZkRDS2p3blhkaHpZWHItYlIwQQ?oc=5" target="_blank">‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment</a> <font color="#6f6f6f">Palo Alto Online</font>

Your agent's guardrails are suggestions, not enforcement
<p>Yesterday, Anthropic's Claude Code source code leaked. The entire safety system for dangerous cybersecurity work turned out to be a single text file with one instruction: <em>"Be careful not to introduce security vulnerabilities."</em></p> <p>That is the safety layer at one of the most powerful AI companies in the world. Just a prompt asking the model nicely to behave.</p> <p>This is not a shot at Anthropic. It is a symptom of something the whole industry is dealing with right now. We have confused guidance with enforcement, and as agents move into production, that distinction is starting to matter a lot.</p> <h2> Why prompt guardrails feel like they work </h2> <p>When you are building an agent in development, prompt-based guardrails seem totally reasonable. You write something like "ne

Understanding Attention Mechanisms – Part 5: How Attention Produces the First Output
<p>In the <a href="https://dev.to/rijultp/understanding-attention-mechanisms-part-4-turning-similarity-scores-into-attention-weights-5aj2">previous article</a>, we stopped at using the <strong>softmax function to scale the scores</strong>.</p> <p>When we scale the values for the first encoded word <strong>“Let’s”</strong> by <strong>0.4</strong>:</p> <p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2mh2c1dzkberz4204ur.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2mh2c1dzkberz4204ur.p
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxNdjBYMFhlczBFa0NWeTB6T0tzLWNfcV9GNmMzaFdHR2xKcW5QV1l4NTBIYVZncmZFU0poSWdibHVSREczeFczZFRtNl9FSWl3bG9EMFhuLUJuWFYtOWh6d2tRYmxOODhGSmdmQkdBbURXVklyZU0tYVdSZ1JhMVFFNjRxZTBvVUlLNnpIWEZCT2ZQX0I3cm9FWUNIYTBRMlRuOVYybXZrYUhDay00UzgwUEZHQjFNeTBFOWMtN2RSbXVIU0FNVV9NNUhwTGwwWml1S2o3Nk1wcUU2SEdxWXF5bDhQZ2YyNHJpeGlOS2R5XzluU0ZyblNnS2xjZElVSC1DMXpYeVl6ZERZeTRCVnVPX1VUS2NKRVlGY1hrVmRZUkFXMWlMUEliOVFJMUdXNllYTHRMcGJhbWJrYmZFWi03TjgwMXhLdDVrZWtMTGRFTEtGdFlsb3lhX2pCWFd6eVJiTlRrZ0pCN3dKSTVnLWoxQmZQc25HVlJLOHhnRFBwekw3WG81MURaYUhncmk0aGx0YmRMWVVqNkxhS0kyZmtXbWtn?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">wsj.com</font>
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxOdFpHbnQyNmZoUzR3dXJDb25HalR1TWp2OWFqVGgyeWRucVVaaUlWaWU1SjJqX0xEcjE5R3g3RGczeU9jRGRvSHc4U0I4RFZickNFZ1k3VnM5VUxQQTFrVXdNWS1IekVRQmZyem40SXpISEVCUEZ6aFd4d2VZVWd3QWQ1UlEtbnJ3MlFQMzdYcWx5dnVqbTNXdXhiOTlScFZzemVUUzFXdEJBTXJaWTJYbXJPLVZCMjVvS1VwVUpndktzNnVpT1AyeFcya0FtbXJqem1idzZIdWhkTWJSV0ktNkduSEV2Wno4VkdGQ3QtdlRTVUhxSXVxNHR2V256ZjVaVG1LZVFIdFhkTnFQUFBuMXcwOUlnSEkxOFRoNWRMdkczRXZqV1V3UG9QcEh4UHJwN0xrUTNYVzkzYmhpSldtc1EzdXJVNW1id2xTSEFwOVplcE1iZVpyUmM1X2R5UnNyckhHQm56eWpCZTRXWW1Hd3N3Ukh1YnpJcjJWV1hZazdqdHZJMlVBTV9RM085TmRSZFRlSlFOTTNhS21NRGVGRjdB?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">wsj.com</font>
‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment - Palo Alto Online
<a href="https://news.google.com/rss/articles/CBMiywFBVV95cUxOa1ZrSUQyY0JEbXEtUDFveWVUMV9SOWxZd05LM1AtOEFkc3d0QlN1X0RuSzd1RGNSM3BCN0pITlpCRUl5UmhWaWpGTXE0Q0ZWcFZqRTA2X1dEcERldk1wZnVWR2hXdGtKUDV0cmxQTVVBNVFDc1FLNXpWM3BYeEI3UE5QQWtvVmhtSmFsV0pqdF9feVhzVHRTbGtuTGNqNjJubGFNWjJ4d2lpUVFtOFA2cm1zYklkQW9vZkRDS2p3blhkaHpZWHItYlIwQQ?oc=5" target="_blank">‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment</a> <font color="#6f6f6f">Palo Alto Online</font>
Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com
<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxQQmRLN1A3MFpjNU9aWFpCV2Z5aVowb0p5SnN2M0RVcE1nYmttWllNREVuY0JPX3dkTVNyaGFKelpMT0ZQVnA4VlB1dl84VDgtczFpTXpWVDhLTmk3OC1WOWF3eFNFdU9zYmxWNE5DNjI5SXJYTzkyVzZyd1djOXlSWEZuTXJlQWI0c2xPOWJXNWJqT3VoQW1lREJtZWVJbzFSQmxRSzVJX09feXJPY1VqNnlSaDlhdUw2TmdHV2NMci1pa2Y5NVhMamVMQmlsR3FzN2w5OWhkUGctQUNhNE9XbTVZTjM3Q09ZN1RlVnBZNmdwRGt0Y3h2MkxqbEdJNFZtYzRMSWQ1Z0dUU1VmcTNIdk44VEVHSk9JU2FLS3BMWVlLdEZJRnF5bHkzTEVsZHFrVXJmMzlnak4wWkJCZkE3OGw2ekh3LUpXcFdyZVh2VEpVS1Nsc2ZKcG5LREpOaFhaekpoMEJmV3JmU3RHZmthUFZ1V1pfSUdjSzNuaFZwQ2I2ZkxhY3cxT3AtUXdzVkhwUEZTZl92OHJBbnRJaU5nbWNn?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> <font color="#6f6f6f">wsj.com</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!