Models model language model benchmark announce reasoning arxiv

PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering

ArXiv CS.AIby Xingyu Li, Rongguang Wang, Yuying Wang, Mengqing Guo, Chenyang Li, Tao Sheng, Sujith Ravi, Dan RothApril 1, 20261 min read0 views

Source Quiz

arXiv:2603.29085v1 Announce Type: new Abstract: Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. A

View PDF HTML (experimental)

Abstract:Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR$^2$-RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR$^2$-RAG achieves up to \textbf{23.5%} higher accuracy, with retrieval gains of up to \textbf{10.5%} in NDCG.

Comments: 11 pages, 2 figures

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29085 [cs.AI]

(or arXiv:2603.29085v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.29085

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xingyu Li [view email] [v1] Mon, 30 Mar 2026 23:52:54 UTC (704 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2603.29085

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ModelsLive

‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment - Palo Alto Online

<a href="https://news.google.com/rss/articles/CBMiywFBVV95cUxOa1ZrSUQyY0JEbXEtUDFveWVUMV9SOWxZd05LM1AtOEFkc3d0QlN1X0RuSzd1RGNSM3BCN0pITlpCRUl5UmhWaWpGTXE0Q0ZWcFZqRTA2X1dEcERldk1wZnVWR2hXdGtKUDV0cmxQTVVBNVFDc1FLNXpWM3BYeEI3UE5QQWtvVmhtSmFsV0pqdF9feVhzVHRTbGtuTGNqNjJubGFNWjJ4d2lpUVFtOFA2cm1zYklkQW9vZkRDS2p3blhkaHpZWHItYlIwQQ?oc=5" target="_blank">‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment</a> Palo Alto Online

Google News: Gemini

1mabout 1 hour ago

ModelsLive

Your agent's guardrails are suggestions, not enforcement

Yesterday, Anthropic's Claude Code source code leaked. The entire safety system for dangerous cybersecurity work turned out to be a single text file with one instruction: "Be careful not to introduce security vulnerabilities." That is the safety layer at one of the most powerful AI companies in the world. Just a prompt asking the model nicely to behave. This is not a shot at Anthropic. It is a symptom of something the whole industry is dealing with right now. We have confused guidance with enforcement, and as agents move into production, that distinction is starting to matter a lot. <h2> Why prompt guardrails feel like they work </h2> When you are building an agent in development, prompt-based guardrails seem totally reasonable. You write something like "ne

DEV Community

6mabout 1 hour ago

ProductsLive

Understanding Attention Mechanisms – Part 5: How Attention Produces the First Output

In the <a href="https://dev.to/rijultp/understanding-attention-mechanisms-part-4-turning-similarity-scores-into-attention-weights-5aj2">previous article</a>, we stopped at using the softmax function to scale the scores. When we scale the values for the first encoded word “Let’s” by 0.4: <a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2mh2c1dzkberz4204ur.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2mh2c1dzkberz4204ur.p

DEV Community

2mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 171 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com

<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxNdjBYMFhlczBFa0NWeTB6T0tzLWNfcV9GNmMzaFdHR2xKcW5QV1l4NTBIYVZncmZFU0poSWdibHVSREczeFczZFRtNl9FSWl3bG9EMFhuLUJuWFYtOWh6d2tRYmxOODhGSmdmQkdBbURXVklyZU0tYVdSZ1JhMVFFNjRxZTBvVUlLNnpIWEZCT2ZQX0I3cm9FWUNIYTBRMlRuOVYybXZrYUhDay00UzgwUEZHQjFNeTBFOWMtN2RSbXVIU0FNVV9NNUhwTGwwWml1S2o3Nk1wcUU2SEdxWXF5bDhQZ2YyNHJpeGlOS2R5XzluU0ZyblNnS2xjZElVSC1DMXpYeVl6ZERZeTRCVnVPX1VUS2NKRVlGY1hrVmRZUkFXMWlMUEliOVFJMUdXNllYTHRMcGJhbWJrYmZFWi03TjgwMXhLdDVrZWtMTGRFTEtGdFlsb3lhX2pCWFd6eVJiTlRrZ0pCN3dKSTVnLWoxQmZQc25HVlJLOHhnRFBwekw3WG81MURaYUhncmk0aGx0YmRMWVVqNkxhS0kyZmtXbWtn?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> wsj.com

Google News: ChatGPT

1m3 days ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com

<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxOdFpHbnQyNmZoUzR3dXJDb25HalR1TWp2OWFqVGgyeWRucVVaaUlWaWU1SjJqX0xEcjE5R3g3RGczeU9jRGRvSHc4U0I4RFZickNFZ1k3VnM5VUxQQTFrVXdNWS1IekVRQmZyem40SXpISEVCUEZ6aFd4d2VZVWd3QWQ1UlEtbnJ3MlFQMzdYcWx5dnVqbTNXdXhiOTlScFZzemVUUzFXdEJBTXJaWTJYbXJPLVZCMjVvS1VwVUpndktzNnVpT1AyeFcya0FtbXJqem1idzZIdWhkTWJSV0ktNkduSEV2Wno4VkdGQ3QtdlRTVUhxSXVxNHR2V256ZjVaVG1LZVFIdFhkTnFQUFBuMXcwOUlnSEkxOFRoNWRMdkczRXZqV1V3UG9QcEh4UHJwN0xrUTNYVzkzYmhpSldtc1EzdXJVNW1id2xTSEFwOVplcE1iZVpyUmM1X2R5UnNyckhHQm56eWpCZTRXWW1Hd3N3Ukh1YnpJcjJWV1hZazdqdHZJMlVBTV9RM085TmRSZFRlSlFOTTNhS21NRGVGRjdB?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> wsj.com

Google News: OpenAI

1m3 days ago

ModelsLive

‘That’s a great point!’: Overly agreeable AI models shown to harm people’s judgment - Palo Alto Online

Google News: Gemini

1mabout 1 hour ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - wsj.com

<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxQQmRLN1A3MFpjNU9aWFpCV2Z5aVowb0p5SnN2M0RVcE1nYmttWllNREVuY0JPX3dkTVNyaGFKelpMT0ZQVnA4VlB1dl84VDgtczFpTXpWVDhLTmk3OC1WOWF3eFNFdU9zYmxWNE5DNjI5SXJYTzkyVzZyd1djOXlSWEZuTXJlQWI0c2xPOWJXNWJqT3VoQW1lREJtZWVJbzFSQmxRSzVJX09feXJPY1VqNnlSaDlhdUw2TmdHV2NMci1pa2Y5NVhMamVMQmlsR3FzN2w5OWhkUGctQUNhNE9XbTVZTjM3Q09ZN1RlVnBZNmdwRGt0Y3h2MkxqbEdJNFZtYzRMSWQ1Z0dUU1VmcTNIdk44VEVHSk9JU2FLS3BMWVlLdEZJRnF5bHkzTEVsZHFrVXJmMzlnak4wWkJCZkE3OGw2ekh3LUpXcFdyZVh2VEpVS1Nsc2ZKcG5LREpOaFhaekpoMEJmV3JmU3RHZmthUFZ1V1pfSUdjSzNuaFZwQ2I2ZkxhY3cxT3AtUXdzVkhwUEZTZl92OHJBbnRJaU5nbWNn?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> wsj.com

Google News: OpenAI

1m3 days ago