Research Papers research paper arxiv large language models scientific reasoning end-to-end reproduction

PRBench: End-to-end Paper Reproduction in Physics Research

HuggingFace PapersMarch 29, 20268 min read0 views

PRBench evaluates AI agents' ability to reproduce scientific research by requiring them to implement algorithms from published papers and match original results, revealing significant challenges in formula implementation, debugging, and data accuracy. (7 upvotes on HuggingFace)

Published on Mar 29

Authors:

Abstract

AI-generated summary

AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.

View arXiv page View PDF Project page GitHub 2 Add to collection

Get this paper in your agent:

hf papers read 2603.27646

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.27646 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.27646 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.27646 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Original source

HuggingFace Papers

https://huggingface.co/papers/2603.27646

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - wsj.com

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI wsj.com

GNews AI manufacturing

1m30 days ago

ReleasesLive

Local Node Differential Privacy

arXiv:2602.15802v2 Announce Type: replace Abstract: We initiate an investigation of node differential privacy for graphs in the local model of private data analysis. In our model, dubbed LNDP*, each node sees its own edge list and releases the output of a local randomizer on this input. These outputs are aggregated by an untrusted server to obtain a final output. We develop a novel algorithmic framework for this setting that allows us to accurately answer arbitrary linear queries about the input graph's degree distribution. Our framework is based on a new object, called the blurry degree distribution, which closely approximates the degree distribution and has lower sensitivity. Instead of answering queries about the degree distribution directly, our algorithms answer queries about the blur

arXiv cs.DS

2mabout 2 hours ago

Research PapersLive

Instance-Optimality in PageRank Computation

arXiv:2512.16087v2 Announce Type: replace Abstract: We study the problem of estimating a vertex's PageRank within a constant relative error, with constant probability. We prove that an adaptive variant of a simple, classic algorithm is instance-optimal up to a polylogarithmic factor for all directed graphs of order $n$ whose maximum in- and out-degrees are at most a constant fraction of $n$. In other words, there is no correct algorithm that can be faster than our algorithm on any such graph by more than a polylogarithmic factor. We further extend the instance-optimality to all graphs in which at most a polylogarithmic number of vertices have unbounded degrees. This covers all sparse graphs with $\tilde{O}(n)$ edges. Finally, we provide a counterexample showing that our algorithm is not in

arXiv cs.DS

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 226 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - wsj.com

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI wsj.com

GNews AI manufacturing

1m30 days ago

Research PapersLive

Instance-Optimality in PageRank Computation

arXiv cs.DS

1mabout 2 hours ago

Research PapersLive

Near-Optimal Distributed Ruling Sets for Trees and High-Girth Graphs

arXiv:2504.21777v2 Announce Type: replace Abstract: Given a graph $G=(V,E)$, a $\beta$-ruling set is a subset $S\subseteq V$ that is i) independent, and ii) every node $v\in V$ has a node of $S$ within distance $\beta$. In this paper we present almost optimal distributed algorithms for finding ruling sets in trees and high girth graphs in the classic LOCAL model. As our first contribution we present an $O(\log\log n)$-round randomized algorithm for computing $2$-ruling sets on trees, almost matching the $\Omega(\log\log n/\log\log\log n)$ lower bound given by Balliu et al. [FOCS'20]. Second, we show that $2$-ruling sets can be solved in $\widetilde{O}(\log^{5/3}\log n)$ rounds in high-girth graphs. Lastly, we show that $O(\log\log\log n)$-ruling sets can be computed in $\widetilde{O}(\log\

arXiv cs.DS

1mabout 2 hours ago

Research PapersLive

Branch-and-Bound Algorithms as Polynomial-time Approximation Schemes

arXiv:2504.15885v3 Announce Type: replace Abstract: Branch-and-bound algorithms (B&B) and polynomial-time approximation schemes (PTAS) are two seemingly distant areas of combinatorial optimization. We intend to (partially) bridge the gap between them while expanding the boundary of theoretical knowledge on the B\&B framework. Branch-and-bound algorithms typically guarantee that an optimal solution is eventually found. However, we show that the standard implementation of branch-and-bound for certain knapsack and scheduling problems also exhibits PTAS-like behavior, yielding increasingly better solutions within polynomial time. Our findings are supported by computational experiments and comparisons with benchmark methods. This paper is an extended version of a paper accepted at ICALP 2025

arXiv cs.DS

1mabout 2 hours ago